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IMAGE PROCESSING METHOD AND APPARATUS (ARTOl) 

Field of the Invention 

The present invention relates to an image processing 
method and apparatus and, in particular, discloses a Digital 
Instant Camera with Image Processing Capability. 

The present invention further relates to the field of 
digital camera technology and, particularly, discloses a 
digital camera having an integral colour printer. 
Background of the Invention 

Traditional camera technology has for many years relied 
upon the provision of an optical processing system which 
images a negative of an image onto a photosensitive film 
which is subsequently chemically processed so as to "fix" 
the film and to siibsequently allow for positive prints to be 
produced which reproduce the original image. Such an image 
processing technology, although it has become a standard, 
can be unduly complexed, is- expensive and difficult 
technologies are involved in full colour processing of 
images. Recently, digital cameras have become available. 
These cameras normally rely upon the utilisation of a 
charged coupled device (CCD) to image the particular image. 
The camera normally includes the storage media for the 
storage of the imaged devices in addition to a connector for 
the transfer of images to a subsequent computer device for 
manipulation and printing out. 

Such devices are generally inconvenient in that all 
images must be stored by the camera and printed out at some 
later stage. Hence, the camera must have sufficient storage 
capabilities for the storing of multiple images and, 
additionally, the user of the camera must have access to a 
subsequent computer system for the downloading of the images 
and there printing out by a computer device or the like. 
Summary of the Invention 

The present invention relates to providing an 
alternative form of camera system which includes a digital 
camera with an integral colour printer. Additionally, the 



camera provides hardware and software for the increasing of 
the apparent resolution of the image sensing system and the 
conversion of the image to a wide range of ""artistic styles" 
and a graphic enhancement. 

In accordance with a first aspect of the present 
invention, there is provided a camera system comprising at 
least one area image sensor for imaging a scene, a camera 
processor means for processing said image scene in 
accordance with a predetermined scene transformation 
requirement, a printer for printing out said processed image 
scene on print media said printer, print media and printing 
ink stored in a single detachable module inside said camera 
system, said camera system, comprising a portable hand held 
unit for the imaging of scenes by said area image sensor and 
printing said scenes directly out of said camera system via 
said printer. 

Preferably the camera system includes a print roll for 
the storage of print media and printing ink for utilisation 
by the printer, s the aid print roll being detachable from 
the camera system. Further, the print roll can include an 
authentication chip containing authentication information 
and the camera processing means is adapted to interrogate 
the authentication chip so as to determine the authenticity 
of said print roll when inserted within said camera system. 

Further, the printer can -include a drop on demand ink 
printer and guillotine means for. the separation of printed 
photographs . 

Brief Description of the Drawings 

Notwithstanding any other forms which may fall within 
the scope of the present invention, preferred forms of the 
invention will now be described, by way of example only, 
with reference to the accompanying drawings in which: 
Fig. 1 illustrates and artcam device constructed in 
accordance with the preferred embodiment; 

Fig. 2 is a schematic block diagram of the main Artcam 
electronic components; 



Fig. 3 is a schematic block diagram of the Artcam Central 
Processor in more detail; 

Fig, 4 illustrates the CCD image organisations- 
Fig. 5 illustrates the storage format for a logical images- 
Fig. 6 illustrates the internal image memory storage formate- 
Fig. 7 illustrates the image pyramid storage formate- 
Fig. 8 illustrates the process steps in creating an output 
images- 
Fig. 9 illustrates the operation of an image iterators- 
Fig. 10 illustrates an example read iterators- 
Fig. 11 illustrates a standard process; 
Fig. 12 illustrates an Iterator workload; 

Fig, 13 illustrates a first example box read iterator 
output; 

Fig. 14 illustrates a second example box read iterator 
output ; 

Fig. 15 illustrates a Box Read Iterator Process; 

Fig. 16 illustrates the storage foirmat utilised by a 

vertical strip Iterator; 

Fig. 17 illustrates a process that requires only a vertical 
strip Write Iterator; 

Fig. 18 illustrates the VLIW processor architecture; 

Fig. 19 illustrates the I/O units block in more detail; 

Fig. 20 illustrates the process of generating a sequential 

read; 

Fig. 21 illustrates the internal' portion of the sequential 
coordinate generator; 

Fig. 22 illustrates the vertical strip generation process; 
Fig. 23 illustrates an implementation of the vertical strip 
generation process; 

Fig, 24 illustrates the form of a single CCD pixel; 

Fig. 25 illustrates the CCD reading process; 

Fig. 26 illustrates the process of sampling an artcard; 

Fig. 27 illustrates the process of reading a rotated 

Artcard; 

Fig. 28 illustrates a flow chart of the steps necessary to 
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decode an Artcard; 

Fig. 29 illustrates a timeline of pixel reading of an 
Artcard; 

Fig. 30 illustrates an enlargement of the left hand corner 
of a single Artcard; 

Fig. 31 illustrates a single target for detection; 
Fig. 32 illustrates the method utilised to detect targets; 
Fig. 33 illustrates the method of calculating the distance 
between two targets; 

Fig. 34 illustrates the process of centroid drift; 
Fig. 35 shows one form of centroid lookup table; 
Fig. 36 illustrates the centroid updating process; 
Fig. 37 illustrates a delta processing lookup table utilised 
in the preferred embodiment; 

15 Fig. 38 illustrates the process of unscrambling Artcard 
data; 

Fig. 39 illustrates the convolution process; 

Fig. 4 0 illustrates one form of implementation of the 
convolver; 

20 Fig. 41 illustrates the compositing process; 

Fig. 42 illustrates the regular compositing process in more 
detail; 

Fig. 43 illustrates the process of warping using a warp map; 
Fig. 44 illustrates the warping bi-linear. interpolation 
25 process; 

Fig.. 45, illustrates the process .of yspan-calculation; 
Fig. 4 6 illustrates the basic span calculation process; 
Fig. 47 illustrates one form of detail implementation of the 
span calculation process; 

30 Fig. 48 illustrates the process of reading image pyramid 
levels; 

Fig. 49 illustrates using the pyramid table for bilinear 
interpolation; 

Fig. 50 illustrates the histogram collection process; 
35 Fig. 51 illustrates the color transform process; 
Fig. 52 illustrates the color conversion process; 



Fig. 53 illustrates the color space conversion process in 
more details- 
Fig. 54 illustrates the process of calculating an input 
coordinated- 
Fig. 55 illustrates the basic process for calculating a 
pixels- 
Fig. 56 illustrates the generalized scaling process; 
Fig. 57 illustrates the scale in X scaling process ; 
Fig. 58 illustrates the scale in Y scaling process; 
Fig. 59 illustrates the tessellation process; 
Fig. 60 illustrates the sub-pixel: translation process; 
Fig. 61 illustrates the compositing process; . 

Fig. 62: illustrates the process of compositing with 
feedback; 

Fig. 63 illustrates the -process of tiling with color from 
the input image; 

Fig. 64 illustrates the process of tiling with feedback; 
Fig. 65 illustrates the process, of tiling with texture 
replacement; 

Fig . 66 illustrates the process of tiling with background 
and tile texture; 

Fig. 67 illustrates the process of applying a texture 
without feedback; 

Fig. .68 illustrates the process of: applying:: a texture with, 
feedback; 

Fig. . 69 illustrates, the process: of notation of :;CeD : pixels; 
Fig. 70 illustrates the process, of interpolation of Green 
subpixels; 

Fig. 71 illustrates the process of interpolation of Blue 
subpixels; 

Fig, 72 illustrates the process of interpolation of Red 
subpixels; 

Fig. 73 illustrates the process of CCD pixel interpolation 
with O degree rotation for odd pixel lines; 

Fig. 74 illustrates the process of CCD pixel interpolation 
with 0 degree rotation for even pixel lines; 
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Fig. 75 illustrates the process of color conversion to Lab 
color space; 

Fig. 76 illustrates the process of calculation of 1/Vx; 
Fig. 77 illustrates the implementation of the calculation of 
in more detail; 

Fig. 78 illustrates the process of Normal calculation with a 
bump map; 

Fig. 79 illustrates the process of' illumination calculation 
with a bump map; 

Fig. 80 illustrates the process of illumination calculation 
with a bump map in more detail; 

Fig. 81 illustrates the process of calculation of L using a 
directional light; 

Fig. 82 illustrates the process of calculation of L using a 
15 Omni lights and spotlights; 

Fig. 83 illustrates one form of implementation of 
calculation of L using a Omni lights and spotlights; 
Fig. 84 illustrates the process of calculating the N.L dot 
product ; 

20 Fig. 85 illustrates the process of calculating the N.L dot 
product in more detail; 

Fig. 86 illustrates the process of calculating the R.V dot 
product ; 

Fig. 87 illustrates the process of calculating the R.V dot 
25 product in more detail; 

Fig. 88 illustrates the attenuation inputs and outputs; 

Fig. 89 illustrates an actual implementation of attenuation 

calculation; 

Fig. 90 illustrates a graph of the cone factor; 
30 Fig. 91. illustrates the process of penumbra calculation; 

Fig. 92 illustrates the angles utilised in penumbra 
calculation; 

Fig. 93 illustrates the inputs and outputs to penumbra 
calculation; 

35 Fig. 94 illustrates an actual implementation of penumbra 
calculation; 



Fig. 95 illustrates the inputs and outputs to ambient 
calculations- 
Fig. 96 illustrates an actual implementation of ambient 
calculations- 
Fig. 97 illustrates an actual implementation of diffuse 
calculations- 
Fig. 98 illustrates the inputs and outputs to a diffuse 
calculations- 
Fig. 99 illustrates an actual implementation of a diffuse 
calculation; 

Fig, 100 illustrates the inputs and outputs :to a specular 
calculations- 
Fig. 101 illustrates an actual implementation of a specular 
calculations- 
Fig. 102 illustrates the inputs and outputs to a specular 
calculation; 

Fig. 103 illustrates an actual implementation of a specular 
calculation; 

Fig. 104 illustrates an actual implementation of a ambient 
only calculation; 

Fig. 105 illustrates the process overview of light 
calculation; 

Fig. 106 illustrates an example illumination calculation for 
a single infinite light source; 

Fig. 107 illustrates an. example illumination calculation for 
a Omni light source without a bump map; 

Fig. 108 illustrates an example illumination calculation for 
a Omni light source with a bump map; 

Fig. 109 illustrates an example illumination calculation for 
a Spotlight light source without a bump map; 

Fig. 110 illustrates the process of applying a single 

Spotlight onto an image with an associated bump-map; 

Fig. Ill illustrates the logical layout of a single 

printhead; 

Fig. 112 illustrates the structure of the printhead 
interface; 



Fig. 113 illustrates the process of rotation of an Lab 
image; 

Fig. 114 illustrates the format of a printed images- 
Fig. 115 illustrates the dithering process; 

Fig, 116 illustrates the process of generating an 8 bit dot 
output; 

Fig, 117 illustrates a card reader; 

Fig. 118 illustrates an exploded perspective of a card 
reader; 

Fig. 119 illustrates a closeup view of the Artcard reader; 
Fig. 120 illustrates a perspective view of . the print roll 
and print head; 

Fig, 121 illustrates a first exploded perspective view of 
the print roll; 

Fig. . 122 illustrates a second exploded perspective view of 
the print* roll; 

Fig. 123 illustrates the print roll authentication chip; 
Fig. 124 illustrates an enlarged view of the print roll 
authentication chip; 

Fig, 125 illustrates- the architecture of the print roll 
authentication chip; 

Fig. 126 sets out the information stored on the print roll 
authentication chip; 

Fig. 127 illustrates the : ; authentication : . process upon 
insertion of a print roll; 

Fig, ' 128 illustrates -a ■ vshxelding r-metai ^layer^^^^^ on top 

of the authentication chip; 

Fig. 129 illustrates the data stored within the Artcam 
authorisation chip; 

Fig. 130 illustrates the process of print head pulse 
characterisation; 

Fig. 131 is an exploded perspective, in section, of the 
print head ink supply mechanism; 

Fig. 132 is a bottom perspective of the ink head supply 
unit; 

Fig, 133 is a bottom side sectional view of the ink head 



- 10 - 

supply unit; 

Fig. 134 is a top perspective of the ink head supply unit; 
Fig. 135 is a top side sectional view of the ink head supply 
unit; 

Fig. 136 illustrates a wire frame view of a small portion of 
the print head; 

Fig. 137 illustrates is an exploded perspective of the 

print head unit; - - 

Fig, 138 illustrates a top side perspective view of the 
internal portions of an Artcam camera, showing the parts 
flattened out; 

Fig, 139 illustrates a bottom side perspective, view of the 
internal portions of an Artcam camera, showing the parts 
flattened out; 

Fig. 140 illustrates a first top side perspective view of 
the internal portions of an T^tcam camera, showing.' the parts 
as encased in an Artcam; 

Fig. 141 illustrates a second top side perspective view of 
the internal portions of an Artcam camera, showing the parts 
as encased in an Artcam; 

Fig, 142 illustrates a second, top side perspective view of 
the Internal portions of an Artcam camera, showing the parts 
as encased in an Artcam; 

Fig. 143 .illustrates the , structure of the ALUs block; 

Fig, 144 illustrates the structure of the read unit; 

Fig. ;145 illustrates the: structure ; to f -the ..write.' unit; 

Fig, 146 illustrates the structure of the ReadWrite unit; 

Fig. 147 illustrates the structure of the Adder ALU; 

Fig, 148 illustrates the structure of the Multiply ALU; 

Fig. 149 illustrates the structure of the Logical 7\LU; 

Fig. 150 illustrates the structure of the Display 

Controller; 

Fig. 151 illustrates the backing portion of a postcard print 
roll; 

Fig. 152 illustrates the corresponding front image on the 
postcard print roll after printing out images; 



Fig, 153 illustrates a form of print roll ready for purchase 
by a consumer. 



Description of preferred and other Embodiments 

The digital image processing camera system constructed 
in accordance with the preferred embodiment is as 
illustrated in Fig, 1. A digital image in camera l^.is 
provided and includes means . for : the insertion of an integral 
print roll (not shown) , The camera unit 1 can include an 
area image sensor 2 which sensors an image 3 for captured by 
the camera. Optionally, /the. second area. :image-.-.s^^^ 4 can 

be provided to also image the scene 3 and to optionally 
provide for the production of stereographic output effects. 

The camera 1 can include an optional colour, display 5 
for the display of the image being sensed by the sensor 2. 
When a simple image is. being displayed. on the display 5, the 
button 6 can be depressed resulting in the printed image 8 
being output by the camera unit 1. A series of cards, 
herein after known as ^^artcards" 9 containing, on one 
surface encoded information and on the other surface, 
containing an image distorted by the particular effect 
produced by artcard 9. The artcard 9 is inserted in an 
artcard reader 10 in the back of camera 1 and, upon the 
insertion, results -in output image .,8 JDeing. : distorted \in the 
same manner as the distortion . appearing on .the surf ace of 
artcard 9 . . Hence, a user :^/wishing. .to;.;produce r 
effect, can insert one of many artcards 9 into- the artcard 
reader 10 and utilise button 6 to take a picture of the 
image 3 resulting in a corresponding distorted output image 
8. 

The camera unit 1 can also include a number of other 
control button 13, 14 in addition to a simple LCD output 
display 15 for the display of informative information 
including the number of printouts left on the internal print 
roll on the camera unit. 

Turning now to Fig, 2, there is illustrated a schematic 
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view of the internal hardware of the camera unit 1. The 
internal hardware is based around an Artcam central 
processor unit (ACP) 31. 
Artcam Central Processor 31 

The Artcam central processor 31 provides many functions 
which form the ^heart' ofthe system. The ACP 31 is 
preferably implemented as a complex, highspeed, CMOS system 
on-a-chip. Utilising standard cell design with some full 
custom regions is recommended. Fabrication on a 0.25^ CMOS 
process will provide the density and speed required, along 
with a reasonably small die area. 

The functions provided by the ACP. 31 include: 

1. Control and digitisation of the area image sensor 
2. A 3D stereoscopic version of the ACP requires two area 
image sensor interfaces with a second optional image sensor 
4 being provided for . stereoscopic effects. 

2. Area image sensor compensation, reformatting, and 
image enhancement. 

3. Memory interface and management to a memory store 

33. 

4. Interface, control, and analog to digital 
conversion of an Artcard reader linear image sensor 34 which 
is provided for the reading of data from the artcards 9. 

5. Extraction of the raw ■ -Artcard .■.. data from the 
digitised and encoded Artcard image. 

6. -Reed-Solomon .errox-"detect ion- -andr correction of the 
Artcard encoded data. The . encoded surface of the artcard 9 
includes information on how to process an image to produce 
the effects displayed on the image distorted surface of the 
artcard 9. This information is in the form of a script, 
hereinafter known as a "Vark script". The Vark script is 
utilised by an interpreter running within the ACP 31 to 
produce the desired effect. 

7. Interpretation of the Vark script on the Artcard 

9. 
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8. Performing image processing operations as 
specified by the Vark script. 

9. Controlling various motors for the paper transport 
36, zoom lens 38, autofocus 39 and Artcard driver 37, 

10. Controlling a guillotine actuator 40 for the 
operation of a guillotine 41 for the cutting of photographs 
8 from print roll 42, 

11. Half-toning of the image data for printing. 

12. Providing the print data to a printhead 44 at the 
appropriate times. 

13. Controlling the print head 44. 

- 14. Controlling the ink pressure feed to printhead 44. 

15. Controlling optional flash, unit. 56 . 

16. Reading and acting on various sensors in the 
camera, including camera orientation sensor 4 6, autofocus 47 
and Artcard insertion sensor 49. 

17. Reading and acting on the user interface buttons 
6, 13, 14, 

18. Controlling the status display 15. 

19. Providing viewfinder and preview images to the 
colour display 5. 

20. Control of the system power consumption, including 
the ACP power consumption via power management circuit 51 

: 21, Providing external ...communications .52, to general 
purpose computers (using USB) , 

22.. Reading and storing.:dnf ormation-:.in^^..a roll 
authenticatipn chip 53 . 

23. Reading and storing information in a camera 
authentication chip 54 , 

24. Communicating with an optional mini-keyboard 57 
for text modification. 

Quartz crystal 58 

A quartz crystal 58 is used as a frequency reference 
for the system clock. As the system clock is very high, the 
ACP^ 31 includes a phase locked loop clock circuit to 
increase the frequency derived from the crystal 58. 
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Image Sensing 

Area image sensor 2 

The area image sensor 2 converts an image through its 
lens into an electrical signal. It can either be a charge 
coupled device (CCD) or an active pixel sensor (APS) CMOS 
image sector. At present, available CCD's normally have a 
higher image quality, however, there is currently . much 
..development occurring in CMOS imagers. CMOS images are 
eventually expected to be substantially cheaper than CCD's 
have smaller pixel areas, and be able to incorporate drive 
circuitry and signal processing. . They :can-also be made in 
CMOS fabs, which are transitioning to 12" 
wafers. CCD's are usually built in 6" wafer fabs, and 
economics may not allow a conversion to 12" fabs. 
Therefore, the -difference in fabrication cost between CCD' s 
and CMOS imagers is likely to increase, progressively 
favouring CMOS imagers. However, at present, a CCD is 
probably the best option. 

The Artcam unit will produce suitable results with a 
1,500 X .1,000 area image sensor. However, smaller sensors, 
such as 750 x 500, will be adequate for many markets . The 
Artcam is less sensitive to image sensor resolution than are 
conventional digital cameras. This is because many of the 
styles contained on Artcards . 9 process,-. the-, image in such a 
way as to obscure the lack of resolution.. .. For. example, if 
the image is. distorted, rto .. simulate the^::^-^^^ of being 

converted to an impressionistic painting, low source image 
resolution can be used with minimal effect. Further 
examples for which low resolution input images will 
typically not be noticed include image warps which produce 
high distorted images, multiple miniature copies of the of 
the image (eg. passport photos), textural processing such as 
bump mapping for a base relief metal look, and photo- 
compositing into structured scenes. 

This tolerance of low resolution image sensors may be a 
significant factor in reducing the manufacturing cost of an 
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Artcam unit 1 camera. An Artcam with a low cost 750 x 500 
image sensor will often produce superior results to a 
conventional digital camera with a much more expensive 1,500 
X 1,000 image sensor. 

Optional stereoscopic 3D image sensor 4 

The 3D versions of the Artcam unit 1 have an additional 
image sensor 4, for stereoscopic operation. This image 
sensor is identical to the main image sensor. The circuitry, 
to drive the optional image sensor may be included as a 
standard part of the ACP chip 31 to reduce incremental 
design cost. Alternatively,. . a .: separate 3D:: Artcam ACP can be 
designed. This option will- reduce the manufacturing cost of 
a mainstream single sensor Artcam. 
Print roll authentication chip 53 

A small chip 53 is included in each print roll '42. 
-This chip replaced the functions of the bar code, optical 
sensor and wheel, and ISO/ASA sensor on other forms^ of 
camera film units such as Advanced Photo Systems file 
cartridges . 

The authentication chip also provides other features": 

1. The storage of data than is mechanically and 
optically sensed from APS rolls 

2. A remaining media length indication, accurate to 

mm. 

3. Authentication . Information to .;: prevent inferior 
copies. 

The authentication chip 53 contains 1024 bits of Flash 
memory, of which 128 bits is an authentication key, and 512 
bits is the authentication information. Also included is an 
encryption circuit to ensure that the authentication key 
cannot be accessed directly. 
Printhead 44 

The Artcam unit 1 can utilise any colour print 
technology which is small enough, low enough power, fast 
enough, high enough quality, and low enough cost, and is 
compatible with the print roll. Relevant printheads will be 
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specifically discussed hereinafter. 

The specifications of the ink jet head are: 



Image type 


Bi-level , dithered 


Colour 


CMY Process Colour 


Resolution 


1600 dpi 


Print head length 


"Page-width' ( lOOmm) 


Print speed 


2 seconds per photo 



Optional ink pressure Controller (not shown) 

The function of the. ink pressure controller depends 
upon the type of ink jet print head 44 incorporated in the 
Artcam. For some types of ink jet, the use of ink pressure 
controller can be eliminated, as the ink pressure is simply 
atmospheric pressure. Other types of print head require a 
regulated positive ink pressure. In this case, the in 
pressure controller consists of a pump and pressure 
transducer. 

Other print heads may require an ultrasonic transducer 
to cause regular oscillations in the ink pressure, typically 
at frequencies around lOOKHz. In the case, the ACP 31 
controls the frequency phase and amplitude of these 
oscillations. 
Paper transport motor 36 

The paper transport motor 36 moves the paper from 
within the print roll 42 past the print head as a relatively 
constant rate. The motor 36 is a miniature motor geared 
down to an appropriate speed to drive rollers which move the 
paper. A high quality motor and mechanical gears are 
required to achieve high image quality, as ' mechanical rxamble 
or other vibrations will affect printed dot row spacing. 
Paper transport motor driver 60 

The motor driver 60 is a small circuit which amplified 
the digital motor control signals from the APC 31 to levels 
suitable for driving the motor 36. 
Paper pull sensor 
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A paper pull sensor 50 detects a user's attempt to pull 
a photo from the camera unit during the printing process. 
The APC 31 reads this sensor 50, and activates the 
guillotine 41 if the condition occurs. The paper pull 
sensor 50 is incorporated to make the camera more 
'foolproof in operation. Were the user to pull the paper 
out forcefully during printing, the print mechanism 44 or 
print roll 42 may (in extreme cases) be damaged. Since it 
is acceptable to pull out the 'pod' from a Polaroid type 
camera before it is fully ejected, the public has been 
'trained' to do this. Therefore, they .are runlikely to heed 
printed instructions not to pull the paper. . 

The Artcam preferably restarts the photo print process 
after the guillotine 41 has cut the paper after pull 
sensing. 

The pull sensor can be implemented as a strain gauge 
sensor, or as an optical sensor detecting a 3mall plastic 
flag which is deflected by the torque that occurs on the 
paper drive rollers when the paper is pulled. The latter 
implementation is recommendation for low cost. 
Paper guillotine actuator 

The paper guillotine actuator 40 is a small actuator 
which causes the guillotine 41 to cut the paper either at 
the end of a photograph, or /when the paper ..pull, sensor 50 is 
activated. 

Paper guillotine actuator driver .40 

The guillotine actuator drive 40 is a small circuit 
which amplifies a guillotine control signal from the APC tot 
the level required by the actuator 41. 
Artcard 9 

The Artcard 9 is a program storage medium for the 
Artcam unit. As noted previously, the programs are in the 
form of Vark scripts. Vark is a powerful image processing 
language especially developed for the Artcam unit. Each 
Artcard 9 contains one Vark script, and thereby defines one 
image processing style. 
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Preferably, the 
processing specific , 



VARK language is highly image 
By being highly image processing 



specific, the amount of storage required to store the 
details of the card are substantially reduced. Further, the 
ease with which new programs can be created, including 
enhanced effects, is also substantially increased. 
Preferably, the language includes facilities for handling 
many. image processing functions including image warping via 
a warp map, convolution, color lookup tables, posterizing an 
image, adding noise to an image, image enhancement filters, 
painting algorithms, brush ; j;i.ttering and /manipulation edge 
detection filters, tiling, illumination, via light sources, 
bumpmaps, text, face detection/, and object detection 
attributes, fonts, including /three dimensional fonts, and 
arbitrary complexity pre-rendered icons . 

Attached in Appendix D is an example of the VARK 
language which includes all of these facilities and has been 
defined by the present applicant with image processing 
functionality in mind. 

Hence, by utilizing the language constructs as defined 
by the. created language, new affects on arbitrary images can 
be created and constructed for inexpensive storage on 
Artcard and subsequent distribution to camera owners . 
Further, on one surface of: .the /card../-;:Gan-:.be provided an 

example illustrating -the. .. effect that a. .-particular VARK 

/script, stored on the ./other. :.::sur f ace . of /the card, will have 
on an arbitrary captured image. 

By utilizing such a system, camera technology can be 
distributed without a great fear of obsolescence in that, 
provided a VARK .interpreter is. incorporated, in .the camera 
device, a device independent scenario is provided whereby 
the underlying technology can be completely varied over 
time. Further, the VARK scripts can be updated as new 
filters are created and distributed in an inexpensive 
manner, such as via simple cards for card reading. 

The Artcard 9 is a piece of thin white plastic with the 
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same format as a credit card (86mm long by 54mm wide) . The 
Artcard is printed on both sides using a high resolution ink 
jet printer. The inkject printer technology is assumed to 
be the same as that used in the Artcam, with 1600 dpi 
(63dpmm) resolution. A major feature of the Artcard 9 is 
low manufacturing cost. Artcards can be manufactured at 
high speeds as a wide web of plastic film. The plastic web 
is coated on both sides with a hydrophilic dye fixing layer. 
The web is printed simultaneously on both sides using a 
'pagewidth' colour ink jet printer. The web is then slit 
and punched into individual:, cards . :On;. one ±ace_:of the card 
is printed a human readable, representation, of .'.the. effect the 
Artcard 9 will have on the sensed image. This can be simply 
a standard image which has been processed using the Vark 
script stored on the back face of the card. 

On the back face of the card is printed an array of 
dots which can be decoded into the Vark script that defines 
the image processing sequence. The print area is 80mm x 
50mm, giving a total of 15,876,000 dots. This array of dots 
could , represent at least 1.89 Mbytes of data. To achieve 
high reliability, extensive error detection and correction 
is incorporated in the array of dots. This . allows a 
substantial portion of the card to be defaced, worn, 
creased, or dirty . with. .no.i.ef feet :: on;..data.., integrity . ^ The 
data coding used is. -Reed-Solomon coding,., .with .lialf . of the 
data devoted, to error... correct ion.. This ival lows 'the storage 
of 967 Kbytes of error corrected -data on each -Artcard 9. 
Linear image sensor 34 

The Artcard linear sensor 34 converts the 
aforementioned Artcard. data image to electrical signals. As 
with the area image sensor 2, 4, the linear image sensor can 
be fabricated using either CCD or APS CMOS technology. The 
active length of the image sensor 34 is 50mm, equal to the 
width of the data array on the Artcard 9. To satisfy 
Nyquist's sampling theorem, the resolution of the linear 
image sensor 34 must be at least twice the highest spatial 
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frequency of the Artcard optical image reaching the image 
sensor. In practice, data detection is easier if the image 
sensor resolution is substantially above this. A resolution 
of 4800 dpi (189 dpmm) is chosen, giving a total of 9,450 
pixels. This resolution requires a pixel sensor pitch of 
5-3nm. This can readily be achieved by using four staggered 
rows of 20|Lun pixel sensors . 

The linear image sensor is mounted in a special package 
which includes a LED 65 to illuminate the Artcard 9 via a 
light-pipe (not shown) , 

The Artcard reader light-pipe can be a moulded light- 
pipe which has several function: 

1- It diffuses the light from the LED over the width 
of the card using total internal reflection f acets . 

2. It focuses the light onto a 16|am wide strip of the 
Artcard 9 using an integrated cylindrical lens. 

3. It focuses light reflected from the Artcard onto 
the linear image sensor pixels using a moulded array of 
microlenses. 

Artcard reader motor 37 

The -Artcard reader ' motor propels the Artcard past the 
linear image sensor 34 at a relatively constant rate. As it 
may not be cost effective to include extreme precision 
mechanical components in the ..Artcard reader, : the motor 37 is 
a standard miniature motor geared down to an appropriate 
speed to drive a pair of rollers which move the Artcard 9. 
The speed variations, rumble, and other vibrations will 
affect the raw image data as circuitry within the APC 31 
includes extensive compensation for these effects to 
reliably read the Artcard data. 

The motor 37 is driven in reverse when the Artcard is 
to be ejected. 
Artcard motor driver 61 

The Artcard motor driver 61 is a small circuit which 
amplifies the digital motor control signals from the APC 31 
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to levels suitable for driving the motor 37. 
Card Insertion sensor 4 9 

The card insertion sensor 49 is an optical sensor which 
detects the presence of a card as it is being inserted in 
the card reader 34. Upon a signal from this sensor 49, the 
APC 31 initiates the card reading process, including the 
activation of the Artcard reader motor 37 . 
Card eject button 13 

A card eject button 13 (Fig. 1) is used by the user to 
eject the current Artcard, so that another Artcard can be 
inserted. The APC 31 detects the -pressing -of the button, 
and reverses the Artcard reader motor. .37. ±0 eject the card. 
Card status indicator 66 

A card status : indicator 66 is provided to signal the 
user as to .the status of the. .Artcard reading process. This; 
can be a standard bi-colour (red/ green) LED. When the card 
is successfully read, and data integrity has been verified, 
the LED lights up green continually. If the card is faulty, 
then the LED lights up red. 

If the camera is powered from a 1.5 V instead of .3V 
battery, then the power supply voltage is less than the 
forward voltage drop of the greed LED, and the LED will not 
light. In this case, red LEDs can be used, or the LED can 
be powered from a. voltage - pump, ■■which-,., a 
circuits in the Artcam which require higher voltage . 
64 Mbit DRAM 33 

To perform the wide variety of image processing 
effects, the camera utilises 8 Mbytes of memory 33. This 
can be provided by a single 64 Mbit memory chip. Of course, 
with changing memory technology increased Dram storage sizes 
may be substituted. 

High speed access to the memory chip is required. This 
can be achieved by using a Rambus DRAM (burst access rate of 
500 Mbytes per second) or chips using the new open standards 
such as double data rate (DDR) SDRAM or Synclink DRAM. 
Camera authentication chip 
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The camera authentication chip 54 is identical to the 
print roll authentication chip 53, except that it has 
different information stored in it. The camera 

authentication chip 54 has three main purposes: 

1. To provide a secure means . of comparing 
authentication codes with the print roll authentication 
chip; 

2. To provide storage for manufacturing information, 
such as the serial number of the camera; 

3. To provide a small amount of non-volatile memory 
for storage of user information. 

Displays 

The Artcam includes an optional colour display 5 and 
small status display 15. ■ Lowest /cost consumer cameras may 
include- a. colour image display, such as a small TFT LCD 5 
similar to those found on some digital cameras and 
camcorders. The colour display 5 is a major cost element of 
these versions of. Artcam, and the display 5 plus back light 
are a major power consumption drain. 
Status display 15 

The status display 15 is a small passive segment based 
LCD, similar to those currently provided on silver halide 
and digital cameras. Its main function is to show the 
number of prints remaining - in; /the . print .rxoli....42 and icons 
for various standard -camera features, such as flash and 
battery status. 
Colour display 5 

The colour display 5 is a full motion image display 
which operates as a viewfinder, as a verification of the 
image to be printed, and as a user interface display. The 
cost of the display 5 is approximately proportional to its 
area, so large displays (say 4" diagonal) unit will be 
restricted to expensive versions of the Artcam unit. 
Smaller displays, such as colour camcorder viewfinder TFT's 
at around may be effective for mid-range Artcams . 

Zoom lens (not shown) 
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The Artcam can include a zoom lens. This can be a 
standard electronically controlled zoom lens, identical to 
one which would be used on a standard electronic camera, and 
similar to pocket camera zoom lenses. A referred version of 
the Artcam unit may include standard interchangeable 35mm 
SLR lenses. 
Autofocus motor 39 

The autofocus motor 39. changes the focus of the zoom 
lens. The motor is a miniature motor geared down to an 
appropriate speed to drive the autofocus mechanism. 
Autofocus motor driver 63 

The autofocus motor driver 63 is a small, circuit which 
amplifies the digital motor control signals from the APC 31 
to levels suitable for driving the motor 39. 
Zoom motor 38 

The zoom motor 38 moves the zoom front lenses in and 
out. The motor is a miniature motor geared down to an 
appropriate speed to drive the zoom mechanism. 
Zoom motor driver 62 

The zoom motor driver 62 is a small circuit which 
amplifies the digital motor control signals from the APC 31 
to levels suitable for driving the motor. 
Communications 

The ACP 31 contains /.a - universal -serial bus (USB) 
interface 52 for communication. ;with per sonal.::.computer s . Not 
all Artcam models are -intended to /include the ^ USB connector, 
as an . added means of differentiating low end Artcams from 
up-market models. However, the silicon area required for a 
USB circuit 52 is small, so the interface can be included in 
the standard ACP, 
Optional Keyboard 57 

The Artcam unit may include an optional miniature 
keyboard 57 for customising text specified by the Artcard. 
Any text appearing in an Artcard image may be editable, even 
if it is in a fancy metallic 3D font. The miniature 
keyboard includes a single line alphanumeric LCD to display 
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the original text and edited text. The keyboard may be a 
•standard accessory. 

The ACP 31 contains a serial communications circuit for 
transferring data to and from the miniature keyboard. 
Power Supply 

The TVrtcam unit uses a single battery 48. Depending 
upon the Art cam options, this is either a 3V Lithium cell, 
or a 1.5 VAA or 7\AA alkaline cell. 
Power Management Unit 51 

Power consumption is an important . design constraint in 
the Artcam. It is desirable, that, either standard camera 
batteries (such as 3V lithium batters) or standard AA or AAA 
alkaline cells can be used. While the electronic 

complexity of the J\rtcam unit is dramatically higher than 
35mm- photographic cameras, the power consumption need not be 
commensurately higher. Power in the. Artcam can be carefully 
managed with all unit being turned off when not in use. 

The most significant current drains are the ACP 31, the 
area image sensors 2,4, the printer 44 various motors, the 
flash unit, 45 and the optional colour display 5 (if 
included) dealing with each part separately: 

1. ACP: If fabricated using 0.25|jm CMOS,' and running 
on 1.5V, the ACP power consumption can be quite low. Clocks 
to various parts of the ACP , chip :can .be qui^ Clocks 
to various parts of the ACP chip ..can, be : turned of f when not 
in use, -virtually eliminating standby current consumption. 
The ACP will only fully used fox approximately 4 seconds for 
each photograph printed. 

2. Area image sensor: power is only supplied to the 
area image sensor when the user has their finger on the 
button. 

3. The printer power is only supplied to the printer 
when actually printing. This is for around 2 seconds for 
each photograph. Even so, suitably lower power consumption 
printing should be used. 
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4. The motors required in the Artcam are all low 
power miniature motors, and are typically only activated for 
a few seconds per photo, 

5. The flash unit 45 is only used for some 
5 photographs. Its power consumption can readily be provided 

by a 3V lithium battery for a reasonably battery life. 

6. The optional colour display -5 is a major current 
drain for two reasons : it must be on for the whole time that 
the camera is- in use, and a backlight will be required if a 

10 liquid crystal display is used. Cameras which incorporate a 
colour display will require, a larger vbattery to achieve 
acceptable batter life. 
Flash unit 45 

The flash unit 45 can be a standard miniature 
15 . electronic . flash, for consumer cameras . 

Artcam Central Processor 

Turning now to Fig. 3, there is illustrated the Artcam 

central processor 31 in more detail. The ACP 31 can take 

many different forms, depending on the technologies utilised. 
20 One for of ACP. 31 is will now be described and includes the 

following components: 

Image Address Interface 93 

Images are manipulated within the Artcam in a variety 

of - ways. Some methods of : manipulation ;::requir,e:::random access 
25 to pixels within an image,.. .. while . .others ..require access to 

pixels in a specific: -logical . order. ; . The -Image Address 

Interface provides an interface, between a client and the 

cached DRAM, allowing specific known processing orders to be 

appropriately cached. 
30 The DRAM interface 81 includes 128 cached lines, each 

32 bytes wide (32 bytes being the standard Rambus data 

transfer unit) . 

The total memory on chip for caches is therefore 4096 

bytes (128 x 32 bytes) . The break up of cache assignment is: 
35 -16 to cache the CPU's program (so programs can run at 

the same time as control ACP processes) 
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-16 to cache CPU program' s data 

-96 floating. These can be assigned to ALUs for 
particular functions, or assigned to CPU program or data as 
desired. 

The 128 cache lines are divided into 8 groups of 16 for 
separate addressing in a given cycle, with appropriate 
multiplexing . 

As stated previously, the image address interface is 
responsible for interfacing between other client portions of 
the ACP chip and the RAMBUS DRAM. In effect, each module 
within the image address -interface (lAI) 93 is an address 
generator. 

There are basically 3 logical types of images 
manipulated by the ACP. They are: 

-CCD Image, which is the Input Image captured from the 

CCD. 

-Internal Image format - the Image format utilsed 
interanly by the Artcam device. 

Print Image - the Output Image format printed by the 
Artcam 

These images are typically different in colour space, 
resolution, and the output & input colour spaces can vary 
from .camera to camera. For example, a CCD image on a low-end 
camera may be a different resolution,: -^^^ different 
colour characteristics -from .that used in a .high-end. camera. 
However all internal image . formats^^ care - the :same format in 
terms of colour space across all cameras. 

In addition, the 3 image types can vary with respect to 
which direction is ^up' . The physical orientation of the 
camera causes the notion of a portrait or landscape image, 
and this must be maintained throughout processing. For this 
reason, the internal image is always oriented correctly, and 
rotation is performed on images obtained from the CCD and 
during the print operation. 
CCD Image Organisation 

Although many different CCD image sensors could be 
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utilised, it will be assiomed that the CCD itself is a 750 x 
500 image sensor, yielding 375,000 bytes (8 bits per pixel). 
Each 2x2 pixel block having the following configuration as 
depicted in Fig. 4. 

A CCD Image as stored in DRAM has consecutive pixels 
from a given line contiguous in memory. Each line is stored 
one after the other. The image sensor interface (ISI) 83 is 
responsible for taking data from -the CCD and storing it in 
the DRAM correctly oriented. Thus a CCD image with rotation 
0 degrees has its first line G, R, G, R, G, and its 

second line as B, G, B, G, B, G... If , the CCD .image should be 
portrait, rotated 90 degrees, the first line will be R, G, 
R, G, R, G and the second line G, B, G, B, G, B...etc. 

Pixels are stored in an : interleaved fashion since all 
colour components are required in order to convert to the 
internal image format. 

It should be noted that the AGP 31 makes no assumptions 
about the CCD pixel format, since actual CCDs for imaging 
may vary from Artcam to Artcam, and over time. All 
processing that takes place via the hardware is controlled 
by microcode in an attempt to extend the usefulness of the 
ACP 31. 

Internal Image Organisation 

Internal images typically, consist of. a number of 
channels. Vark images can include, but are . not limited to: 

Lab 

Laba 
LabP 

ap 

L 

L, a and b correspond to components of the Lab colour 
space, a is a matte channel (used for composing) , and p is a 
bump-map channel (used during brushing & illuminating) . 

The Vark Accelerator 79 functions require images to be 
organised in a planar configuration. Thus a Lab image would 
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be stored as 3 separate (probably contiguous) blocks of 
memory: 

one block for the L channel, 
one block for the a channel, and 
one block for the b channel 

Within each channel block, pixels are stored 
contiguously for a given . row (plus some optional padding 
bytes), and rows are stored one after the other. 

Turning to Fig, 5 there is illustrated an example form 
of storage of a logical image 100. The logical image 100 is 
stored in a planar fashion having L. 101, a . 102 and b 103 
colour components stored one after ^another. Alternatively, 
the , logical . image 100 can. be stored in a. ..compressed format 
having an uncompressed L component 101 and compressed A and 
B components 105, 106. 

Turning to Fig. 6, the pixels of for line n 110 are 
stored together before the pixels of for line and n + 1 
(111) . With the image being stored in contiguous memory 
within a single channel. 

In the 8MB-memory model, the final Print Image after 
all processing is finished, needs to be compressed in the 
chrominance channels. Compression of chrominance channels is 
4:1, causing an overall compression of 12:6, or 2:1, 

:Other . than .the final .Print vimage,. ...images , in . the. - Artcam 
are typically not. compressied. ..Because /.of. -memory .constraints, 
software may choose to.:;compress ::the . final/ Print. .^^^ in the 
chrominance channels by scaling each of these channels by 
2:1. If this has been done, the PRINT Vark function call 
utilised to print an image must be told to treat the 
specified chrominance . channels . as compressed. The PRINT 
function is the only function that knows how to deal with 
compressed chrominance, and even so, it only deals with a 
fixed 2:1 compression ratio. 

Although it is possible to compress an image and then 
operate on the compressed image to create the final print 
image, it is not recommended due to a loss in resolution. In 
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addition, an image should only be compressed once - as the 
final stage before printout. While one compression is 
virtually undetectable, multiple compressions may cause 
substantial image degradation. 
5 Clip image Organisation 

Clip images stored on Artcards have no explicit support 
by the ACP 31. Software is responsible for taking any images 
from the current Artcard and organising the data into a form 
known by the ACP, If images are stored compressed on an 

10 Artcard, software is responsible for decompressing them, as 
there is no specific hardware-, support :.for.. decompression of 
Artcard images. 
Image Pyramid Organisation 

During, brushing, tiling, and warping processes utilised 

V5 ...to .. manipulate an image it is necessary to . compute the 
average colour of a particular area in an. image. Rather than 
calculate the value for each area given, these functions 
make use of an image pyramid. As illustrated in Fig. 7, an 
image pyramid is effectively a multi-resolution pixel-map. 

20 The original . image 115 is a 1:1 representation. Low-pass 
filtering and sub-sampling. by 2:1 in each dimension produces 
an image H the original size 116, This process continues 
until the entire image is represented by a single pixel. An 
image pyramid is . constructed , f rom -an- --oxiginal ^ internal 

25 format image, and consumes 1/3 of . the isize .taken up by the 
original image (1/4 + 1/16 4- .1/64 + ...) , For an original 
image of 1500 x 1000 the corresponding image pyramid is 
approximately ^^"MB. ]\n image pyramid is constructed by a 
specific Vark function, and is used as a parameter to other 

30 Vark functions. 

Print Image Organisation 

The entire processed image is required at the same time 
in order to print it. However the Print Image output can 
comprise a CMY dithered image is only a transient image 

35 format, used' within the Print Image functionality . However, 
it should be noted that colour conversion will need to take 



- 30 - 

place from the internal colour space to the print colour 
space. In addition, this colour conversion can be tuned to 
be different for different print rolls in the camera with 
different ink characteristics e.g. Sepia output can be 
accomplished by using a specific sepia toning Artcard, or by 
using a sepia tone print-roll (so all Artcards will work in 
sepia tone) . 

Colour Spaces " ' "■ 

There are 3 colour spaces used in the Artcam, 
corresponding to the different image types: 

CCD Image has a unique CCD colour .space . ■ .. ^ 

Internal Image has the internal colour space 

Print Image has the printer colour space 

The ACP has no direct knowledge of specific colour- 
spaces. Instead,, it relies on client colour space conversion, 
tables to convert .between CCD, internal, and printer colour 
spaces : 

CCD RGB 

Internal Lab 

Printer CMY 

Removing the colour space conversion from the ACP 31 
allows: 

-Different CCDs to be used in different cameras 
-Different inks (an dif ferent/. print vji^ol-l^^ time) 'to 

be used in the same camera 

-Sepajcation of CCD selection, from "ACP^^^^^^^^^d^^ 

-A well defined internal colour space for accurate 
colour processing 

The overall process for creating an output image is as 
illustrated in Fig. 8. The process 120 includes rigging in a 
CCD image 121 in a CCD colour space, the conversion of the 
CCD image to an internal image 122 in an internal colour 
space, the continual processing 123 of- the internal image to 
produce a final internal image, followed by the creation of 
a print image 124 for printing out in the printer's colour 
space. With each conversion 12 6, 127 colour tables are 
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required for colour mapping the images from one colour space 
to another. These colour tables can be provided in the 
Artcam ROM or in the particular print ROM. 
Image access 

Access to images is via special image address 
generators, defined logically below. The Image Address 

Interface 93 contains a number of these address generation 
state machines (AGSM) . 

Each AGSM has a set of registers for defining image 
characteristics : 



Register 
Name 


# bits 


Description 


ImageStart 


32 


The address in memory where the image 
starts 


ImageHeight 


12 


The number of lines in the image 


ImageWidth 


12 


The number of pixels in a line 


RowOf f set 


12 


The number of bytes from one row to 
the next . 

Equals ImageWidth + any padding 


StartRow 


12 


Which row to start at in the image 


EndRow 


12 


The last row+1 to be returned or 
written to within the image 


StartPixel 


12 


Left: border. - of- r^t he /^section of the 
image 


EndPixel 


12 


The last pixel+1 to be returned or 
written to along a given row. 


Loop 


1 


Keep looping the data. 



Random Access to pixels 

Images are rarely required to be* accessed in completely 
random (x, y) fashion, although it is straightforward enough 
to access a given pixel within a channel by the following 
addressing algorithm: 

Address for pixel (X, Y) = ImageStart + (RowOffset * Y) + X. 
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This only gives the address of a single colour 
channel's component, and 3 such operations would be required 
to access all 3 colour components of a single pixel. 

Image Iterators = Sequential Access to pixels 

The primary image pixel access method for software and 
hardware algorithms is via Image Iterators located within 
the Image Address Interface 93. Image iterators perform all 
of . the addressing and caching of the pixels within an. image 
channel and either read or write pixels for their client . 
Read Iterators read pixels in a specific order for their 
clients, and Write Iterators . write .pixels. in a specific 
order for their clients. 

Turning ;to Fig. 9, there - Is illustrated the operation of 
the Image Iterators of the embodiment. Each iterator, e.g 
130, is interconnected to the DRAM 33 via DRAM cache 131. 
The Read Iterator, e.g 130, and Write Iterators, e.g 132, 
act as an intermediary between a client, e.g 133, 134, 
requesting the data and the data stored within the DRAM 33. 
The iterators are responsible for correct ordering of image 
data. 

Turning to Fig. 10, there is a illustrated an example 
Read Iterator 130 which can comprise a state machine 136 
interconnected and controlling a FIFO 137, The state 
machine 136 is responsible .for ..sending:-:- the -requests to the 
DRT^ cache and keeping the FIFO 137 full. Further, the 
..state .machine .136 receives-. 'readv? requests ^v^^ clients and 
clocks-out FIFO data from the FIFO queue 137 in response to 
those read requests. 

The Read Image Iterators 130 can be thought of as a 
FIFO that contains the entire image in a specific order (of 
course they are not implemented as such) . Every time a pixel 
is read from the FIFO 137, the next pixel from the image is 
read into the end of the FIFO, 

Write Image Iterators can similarly be considered as a 
FIFO that is written to by a process. The process writes 
pixels in a specific order to write out the entire image. 
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As illustrated in Fig. 11, typically a process 140 will 
have its input tied to a Read Iterator 141, and output tied 
to a corresponding Write Iterator 142. 

A variety of Image Iterators exist to cope with the 
most common addressing requirements of image processing 
algorithms. In most cases there is a corresponding Write 
Iterator for each Read Iterator. The different Iterators are 
listed in the following table: 



Read Iterators 


Write Iterators 


Sequential Read 


Sequential Write 


Box Read 




Vertical Strip. Read. 


Vertical . .Strip. Write 



Iterators.. In the ACP there are 5 Read Iterators and only 3 
Write Iterators. 

Although an Iterator is perceived to be an unlimited 
FIFO, in practice there is a small FIFO connected to two or 
more cache lines. The small FIFO is required to allow for 
the fact that more than one Iterator is likely to be in use 
at one time, and only one access can be made to the cache in 
a single cycle. 

All FIFOs belonging to Image Iterators can preferably 
be accessed by software as memory mapped I/O. General 
software algorithms that . may v.:not : ±)e^ ^ v^^ to be 

microcoded can therefore take advantage of .the image access 
mechanisms . 
Table Access 

It can often be necessary to lookup values in a table. 
Linear table: set up by software eg 256 values of 1 byte 
each . 

ALUs write a byte lookup address to one FIFO, 
The linear table address generator looks up the value 
next cycle (optional multiply by 2 for 16 bit entries) and 
puts results (8 or 16 bits) into the output FIFO. For 16 
bits the order is always same (lo/hi or hi/lo) . Value is 
written to FIFO in cycle N, first 8 bits available from FIFO 
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at start of N+2 (i,e, skips one cycle) • 
CCD Image Access 
Random Access to pixels 

There is no special address generator for specifying 
fast access to CCD images in DRT^. If a process requires 
random access it must directly address DRAM and . decode image 
pixels itself. 

Sequential Read and Sequential Write Iterators 

The simplest Image Iterator is the Sequential Read 
Iterator. It presents the pixels from a channel one line at 
a time from top to bottom, -and :within v -a : iine, pixels are 
presented left to right. The padding ..bytes^.are . not presented 
to the client. It is most useful 'for algorithms that must 
perform some process on each pixel from an image but don't 
care about the order of the pixels being processed. 

The Sequential Read Iterator comprises 2 cache lines 
and a small (5 bytes) FIFO. While 32 pixels are being 
presented from one cache line, the other cache line can be 
loaded from memory. 

Complementing the Sequential Read Iterator is a 
Sequential Write Iterator. Clients write pixels to a . FIFO 
owned by a Sequential Write Iterator that subsequent writes 
out a valid image using appropriate caching and appropriate 
padding bytes . The Sequential ; Write- :Iter:ator-:/a^^^ comprises 
2 cache lines and a small FIFO. 

A process that, performs:: :an-: opexation^^^^^r^^^^^ pixel of 

an image independently would typically use a. Sequential Read 
Iterator to obtain pixels, and a Sequential Write Iterator 
to write the new pixel values to their corresponding 
locations within the destination image. It is valid. to have 
the source image and destination image to be the same, since 
a given input pixel is not read more than once. 
Internal Format Image Access 

Further, as on a single cycle 4 bytes can be 
transferred from an Iterator's cache into the FIFO, this 
allows up to 4 Iterators to do the same thing if cache 
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accesses are staggered. The net effect is that 4 Iterator 
FIFOs can be accessed every clock cycle without the caches 
having to support multiple accesses per cycle. 4 Iterators 
may be 3 Read Iterators and one Write Iterator. For example, 
as shown in Fig. 12, a single cycle it is possible to read 3 
pixels, 1 from each of 3 Read Iterators 145-147, perform 
some processing on them 148, and take -the single - pixel 
output . (derived from .a previously read 3 pixels) and 
transfer it to a Write Iterator 149 . The . average processing 
time for a single pixel in output would thus be 1 cycle, 

A variety of Image:: Iterators, .exist ::rto.-^ with the 
most common addressing - . requirements . -of -. -image processing 
algorithms. They are: 

Sequential Read (previously discussed) 

Sequential Write (previously discussed) 

Box Read 

Vertical Strip Read 

Vertical-Strip Write 
Box Read Iterator 

The Box Read Iterator is used to present pixels in an 
order most useful for performing general-purpose ..filters, 
convolves and the like. The . Iterator presents pixel values 
in a square box around the sequentially read pixels . The- box 
is limited to ; being 3, . .5.,^>vor;: 7; .pixels:^^^^ client has 

the choice of duplicating - edge pixels,: - or ; having non- image 
pixels to be a . constant -^':value. The client : : also has the 
option of starting the center pixel of . 
IteratorSpecif ici : 

The special purpose register IteratorSpecif ici has the 
following bit usage: 



Bi 
ts 


Name 


Usage 


0 


DuplicateEdg 
ePixels 


1 = duplicate edge pixels for box 
region outside image 
0 = return OutsidelmagePixel for 
box region outside image 
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1- 

8 


Out s i de Image 
Pixel 


Constant pixel value to return for 
pixels outside the actual image 
area if DuplicateEdgePixels = 0. 


9- 
11 


Reserved 





10 



15 



20 



25 



30 



is used to . determine a sub-sampling in terms of which input 
pixels will be used, as the center of the box. The usual 
value is 1, which means that each pixel is used as the 
center of the box. The value ^"2" would be useful in scaling 
an image down by 4:1 : as :iTi uthe case " of '^^^^ an image 

pyramid. Using pixel : addresses. ; from rthe : previous diagram, 
the box would be centered around pixel 0, then 2, 8, and 10. 

In. Fig. 13 there is shown a first example of the box 
read iterator, output with .Fig. 14. showing a second example . 
In Fig. 13, a box .region, e.g 150, is output . for a current 
input pixel 151 with Fig.l3 illustrating the 3x3 pixel 
output case. A first series of pixels 152 illustrates the 
box read iterator output for the current pixel 151 when 
duplication of edge pixels is set. A second series of 
output pixels 153 illustrates the case ' when duplication of 
edge pixels . is not set. In this case, a pre-set constant 
"'outside image" pixel value is output. Fig, 14 illustrates, a 
similar case for the current- pixel^.%i;56-^haviTig-;:a 3x3 output 
grid 155. 

As • illustrated . in Fig. 15, a process that uses the Box 
Read Iterator 160 for input would most likely use the 
Sequential Write Iterator 161 for output since they are in 
sync. A good example is the convolver 162, where N input 
pixels are read to calculate 1 output pixel. 

The Box Read Iterator will require a maximum of 14 (2 x 
7) cache lines and a small (5 bytes) FIFO. While pixels are 
presented from one set of cache lines, the other cache lines 
can be loaded from memory, 
Vertical-Strip Read and Write Iterators 

In some instances it is necessary to write an image in 
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output pixel order, with no knowledge about the direction of 
coherence in input pixels in relation to output pixels. 
Examples of this are rotation and warping. If it is 
necessary to rotate an image 90 degrees, and process the 
output pixels horizontally, a complete loss of cache 
coherence may result. On the other hand, if it is necessary 
to process ..the output image one cache line's width of .pixels 
at a time and .then advance to the next line (rather than 
advance to the next cache-line's worth of pixels on the same 
line), we will gain cache coherence for some input image 
pixels , 

It can also be the .case that there is,., known . /'block' 
coherence in the input pixels (such as colour coherence) , in 
which case the read governs the processing order, and the 
..write, to be synchronised, . must., follow the ; same pixel order.. . 

With the vertical strip Iterators, the order of pixels 
presented as input (Vertical-Strip Read) , or expected for 
output (Vertical-Strip Write) is the same and is depicted in 
Fig. 16. The order is pixels 0 to 31 (165) from line 0 
(166), then pixels 0 to 31 of line 1 (167) etc., for all 
lines of the image, thereby making up first strip 169, then 
pixels 32 to 63 of line 0, pixels 32 to 63 of line 1 etc., 
making up second strip 170, In the final vertical strip 
there may not be exactly . :/32 - pixels i wide;.. :?>.Inv4;his case^ only 
the actual pixels in the:,image ..are ^pr^esentediror^^^ as 
input . 

Referring to Fig. 17, a process 173 that requires only a 
Vertical-Strip Write Iterator will typically have a way of 
mapping input pixel coordinates given an output pixel 
coordinate. It would access 175 the input image pixels 
according to this mapping, and coherence is determined by 
having sufficient cache lines on the ^random-access' reader 
for the input image. 

It is not meaningful to pair this Write Iterator with a 
Sequential Read Iterator or a Box read Iterator, but a 
Vertical-Strip Write Iterator does give significant 
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improvements in performance in certain situations. 

Clients read pixels from the FIFO owned by the 
Vertical-Strip Read Iterator that reads images cached 
appropriately. Clients write pixels to the FIFO owned by the 
Vertical-Strip Write Iterator that subsequent writes out a 
valid image using appropriate caching and appropriate 
padding bytes . Each Iterators requires 2 cache lines, and a 
small (5 byte) FIFO. 
Table I/O Units 

It is often necessary to lookup values in a table 
(which may also be an ...image) ;While. :-,Image./:.J^^^ only 
have a single FIFO, Table . I/O Units .. require 2 . FIFOs - an 
input FIFO and an output FIFO.- Clients pass indexes into the 
Input FIFO (17 bits wide) and receive values from the table 
via the Output FIFO (16 bits wide) . 
1 Dimensional Tables 
Direct Lookup 

A direct lookup is a simple indexing into a 1 
dimensional lookup table. The value passed in by the client 
via the .Input FIFO is shifted .to the appropriate location 
using a Barrel Shifter, ANDed with a mask, and then ORed 
with the Base Address to give the final address. The 8 or 
16 bit data value at the address is placed into the Output 
FIFO . Address generation-. ,^takes...;:l ;.:r:ycle./ :,::and ^^.transf erring 
the requested data from ; -the r cache, -to the -Output . FI FO also 
takes 1 cycle (assuming a cache hit) . 
Interpolate table 

This is the same as a linear table except that 2 values 
are returned for a given address: The value returned are 
Table[X], and Table[X+l]. if x+1 is invalid, Table[X] is 
returned twice. Address generation takes 1 cycle, and 
transferring the requested data from the cache to the Output 
FIFO takes 2 cycles (assuming a cache hit) . 
DR7VM FIFO 

A special case of a ID table is a DRAM FIFO. It is 
often necessary to have a simulated FIFO of a given length 
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using DRAM and associated caches. With a DR7\M FIFO, clients 
do not index explicitly into the table, but read and write 
to the table as if it were a large FIFO. Two 2 counters keep 
track of input and output positions in the simulated FIFO, 
and cache to DRAM as needed. When values are taken from the 
Output FIFO by the client, the next values are placed into 
the FIFO from the cache. When values are placed into the 
Input FIFO by the client, they are placed into the cache at 
the next position . 

2 Dimensional Tables 
Direct Lookup 

.A .2 dimensional, direct ^lookup is ..not ..included ..at the 
:moment. All cases of 2D lookups are -needed for bi-linear 
interpolation . 

Bi-Linear lookup . 

This kind of lookup is . necessary for bi-linear 
interpolation. Given an X and Y coordinate in a table 4 
values are returned after lookup. The four values (in order) 
are : 

Table [X, Y] 
Table [X+1, Y] 
Table [X, Y+1] 
Table [X+1, YH-1] 
The order specified, allows .fox the i)est ..cache .coherence.. 

3 Dimensional Lookup 
Direct Lookup 

A 3 dimensional direct lookup is not required at the 
moment. 'All cases of 3D lookups are needed for tri-linear 
interpolation. 
Tri-linear lookup 

This kind of lookup is necessary for tri-linear 
interpolation. Given an X, Y, and Z coordinate, 8 values are 
returned in order from the lookup table: 

Table [X, Y, Z] 

Table [X+1, Y, Z] 

Table [X, Y+1, Z] 
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Table [X+1, Y+1, Z] 
Table [X, Y, Z+1] 
Table [X+1, Y, Z+1] 
Table [X, Y+1, Z+1] 
Table [X+1, Y+1, Z+1] 

The 3 values passed in by the client are barrel 
shifted, ORed together with the base address, and looked up. 
The 8 sets of 1 byte values are returned via the Output 
FIFO. Image Pyramid Access 

During brushing, tiling, and warping It is often 
necessary to compute . the average colour of . a .-particular area 
in an image. Rather than calculate the value for each area 
given, these functions malce use of an image pyramid as 
previously illustrated in Fig. 7, An image pyramid' is 
effectively a multi-resolution pixel-map. The original image 
is a 1:1 representation. Low-pass filtering and sub-sampling 
by 2:1 in each dimension produces an image H the original 
size. This process continues until the entire image- is 
represented by a single pixel. 

To access an image pyramid a list of image level 
addresses is required. These are 12 x 32 bit registers, each 
stores the address of a given level in the pyramid in the 
RDRAM memory. The width - and height of : the/ -original image 
(level 0) is also required. 

The client specifies a pixel address in terms of 3 
components: x, y, and level. On subsequent cycles, 4 pixel 
units are returned in a specific order via a FIFO: 
The pixel at ( INTEGER [scaled x] , INTEGER [scaled y] , z) 
The pixel at ( INTEGER [scaled x]+l, INTEGER [scaled y] , z) 
The pixel at ( INTEGER [scaled x] , INTEGER [scaled y]+l, z) * 
The pixel at ( INTEGER [scaled x]+l, INTEGER [scaled y]+l, z) 

The offset from the start of an image to a given (x, y) 
coordinate is given by: RowBytes * Y + X. 

For a different level of the pyramid, a ' simple barrel shift 
right of the RowBytes value by the level number gives the 
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RowBytes value for that level. This value needs to be 
multiplied by a scaled Y (also barrel shifted) and the 
result added to a barrel shifted X value. For example, 

if the scaled (X, Y) coordinate was (10,4, 12,7) 4 pixels 
5 would be returned in the order (10, 12), (11, 12), (10, 13) 
and (11, 13) . When pixels are exactly aligned (no fractional 
component), the ^'+1". pixels are duplicated (to save a read 
from DRAM) . When a coordinate is outside the valid 

range, clients have the choice of edge pixel duplication or 

10 returning of a constant colour value (typically black) , 
DRAM Interface 81 

The DRAM used by the- Artcam is a ' 64Mbit (8MB) RAMBUS 
Dram operating at 500MHz. Using RAMBUS DRAM implies that 
applications should minimize the number of random memory 

15 accesses to avoid .degraded memory access performance. 

To take advantage of the 4 internal banks of memory in 
a single DRAM chip, every 32 bytes should be in a different 
bank with address wiring accordingly. The 4 bank internal 
arrangement of RAMBUS DRAM can also be used to advantage if 

20 necessary as long as this does not create unnecessary 
algorithmic complexity. 

Bank accesses can have their latencies overlapped, so 
while data is being transferred from one bank, another can 
be setting up for the transfer. Interleaved in this way, 

25 assuming a worst case of -:a .DRAM-internal^-cache miss every 
access, 4 sets of 32 byte reads can. -be accomplished in 
320ns. 
Cache Lines 

In order to reduce effective memory latency, the ACP 
30 contains 128 cache lines, each 32 bytes wide. The total 
memory on chip for caches is therefore 4096 bytes (128 x 32 
bytes) , The breakup of cache assignment is: 

16 to cache the CPU's program (so programs can run at 
the same time as control ACP processes) 
35 16 to cache CPU program's data 

96 floating. These can be assigned to ALUs for 
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particular functions, or assigned to CPU program or 
data as desired.' 

The 128 cache lines are divided into 8 groups of 16 for 
separate addressing in a given cycle, with appropriate 
5 multiplexing. 

Memory Organization 

Memory in an Artcam consists of a contiguous 32MB area 
(of which 8 MB .is actually used) . In addition to the real 
memory, there are some other non-contiguous address spaces 
10 which are effectively ^virtual' memory areas. These are ACP 
registers, used for memory mapped: I/O.: The. memory 
organization for an Artcam with 8MB of - RDRAM is .shown in the 
following table: 



15 



20 



25 



Program scratch RAM 


0.50 MB 


Artcard data 


1 .00 MB 


Photo Image, captured from CCD 


0.50 MB 


Print Image (compressed) 


2.25 MB 


I.Channel of expanded Photo Image 


1.50 MB 


1 Image Pyramid of single channel 


1.00 MB 


Intermediate Image Processing 


1.25 MB 


TOTAL 


8 MB 



channel). To accommodate other objects in .the 8MB model, the 
Print Image needs to be: compressed. If .the chrominance 
channels are compressed by 4:1 they require only 0.375MB 
each) . The memory model described here assumes' a - single 8 
MB RDRAM. Other models of the TVrtcam. may have more memory,, 
and thus not require compression of the Print Image. In 
addition, with more memory a larger part of the final image 
can be worked on at once, potentially giving a speed 
improvement. The ejecting or inserting an Artcard 

■invalidates the 5.5MB area holding the Print Image, 1 
channel of expanded photo image, and the image pyramid. This 
space may be safely used by the Artcard Interface for 
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decoding the Artcard data. 
VLIW Vector Processor 74 

In order to reduce the complexity of the ACP design, 
the ACP contains a VLIW (Very Long Instruction Word) Vector 
Processor 74. The processor is essentially a set of I/O 
Units 177 connected to a set of ALUs 178 via FIFOs 179 as 
illustrated in Fig. 18, The Cache Interface 176 .is 
described: separately below. It provides the interface to 
DR7\M 33 and is the primary input and output mechanism for 
the VLIW 74. 

The I/O Units Block 177 consists of :a. ^niomber of types 
of address generators, each linked to a specif ic FIFO and 
the Cache Interface. The address generators are able to read 
and write data (specif ically . images in a variety of formats) 
as well as tables and simulated FIFOs in DRAM. They are 
customizable under software control, but cannot be 
microcoded. 

The FIFOs 179 connecting the I/O Units 177 to the ALUs 
178 are tied to specific I/O Units and specific ALUs. In 
sijmmary there are: 

5x8 bit output FIFOs (from I/O unit to ALU) 
3x8 bit input FIFOs (from ALU to I/O unit) 
4 X 16 bit output FIFOs (from I/O unit to ALU) 
4 X 17 bit input FJFOs (from .ALU , to. i/.O . unit) 
External processes have the, .ability- to . write to 1 of 
these . 8 bit input FIFOs, ^and. to read .f rom ,1 :.o the 8 bit 
output FIFOs. This allows other parts of the. chip to provide 
input (for example the Image Sensor Interface can provide 
the pixels from the CCD) or to process the output (for 
example the Print Head Interface is able to take pixels in 
order to print them) , These two FIFOs are known as the VLIW 
Input FIFO 180 and VLIW Output FIFO 181 respectively. 

The ALUs Block 178 consists of a number of types of 
microprogrammable ALUs coupled together. Each of the ALUs 
contains a number of registers, some microcode RAM, and 
connections to the outside world. The connections are 
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inputs, outputs, or both inputs and outputs. Specific ALUs 
connect to the FIFOs 179 and via them to the Address Units. 

The Address and Data Buses connection 182 allows the 
CPU to read and write registers in the VLIW Vector 
.5 Processor, as well as * each ALU's microcode RAM. Rather than 
have the microcode in ROM inside the VLIW Vector Processor, 
the microcode is in RAM, with the program CPU responsible 
for loading it up. For the same space on chip, this tradeoff 
reduces the maximum size of any one function to the size of 
10 the RAM, but allows an unlimited number of functions to be 
written in microcode. Functions = . implemented using 'ALU 
microcode include Vark acceleration, . -Artcard;. reading, and 
Printing functions . 

15 The VLIW Vector Processor scheme has several advantages 

for the case of the ACP: 

Hardware design complexity is reduced 

Hardware risk is reduced due to reduction in complexity 
Hardware design time does not depend on all Vark 
20. functionality being implemented in dedicated silicon. 

Space on chip is reduced overall (due to large number 
of processes able to be implemented as microcode) 
Functionality can be added to Vark (via microcode) with 
no impact on hardware design time., 
25 ALUs Block 178 

The ALUs Block 17:8 /consists of . --av ^number : of types of 
microprogrammable ALUs coupled together. Each of the ALUs 
contains a number of registers, some microcode RAM, and 
connections to the outside world. The connections are 
30 inputs, outputs, or both inputs and outputs. Specific ALUs 
connect to the FIFOs and via them to the I/O Units, 

The different ALU types are: 

Memory Interface Units: connected to the FIFOs 
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Read Unit - attached to FIFO corresponding to a Read 
Iterator 
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• Write Unit - attached to a FIFO corresponding to a Write 
Iterator 

• ReadWrite Unit - attached to 2 FIFOs corresponding to 
Table I/O Unit 

Processing Units: 

• Adder ALU - for counters, comparisons and simple loops. 

• Multiply ALU - single cycle multiply/accumulate for 
interpolations and convolves 

• Logical ALU - for bit manipulation . 

A summary of. each type of . ALU Unit is v:listed- in the 
following table: 



ALU Unit 


# of 


# of 


# of 


# of 


Size of 


Name 


Regist 


Data 


Status 


Contro 


Microcode 




ers 


Output 


Output 


1 


RAM 






s 


s 


Output 
s 




Read 


1 


1 




1 


800 bits 


Write 


1 






1 • 


704 bits 


ReadWrit 
e 


2 


1 




1 


1216 bits 


Adder 


4 


3 


1 




1632 bits 


Multiply 


4 


4 


2 




1920 bits 


Logical 


4 


2 


1 




1376 bits 



The outputs from the units are connected to the inputs 
so that each unit can select input from both its own outputs 
and all other units' outputs. The structure is as 
illustrated in Fig, 143. As shown in Fig, 143, there are 
multiple copies of each unit. The following table lists how 
many of each type of unit are present, and provides an 
overall total of specific resources. 
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Unit Name 


# of 


# of 


# of 


# of 


# of 


Size 




Units 


Regis 


Data 


Statu 


Contr 


of 






ters 


Outpu 


s 


ol 


Microc 








ts 


Outpu 
ts 


Outpu 
ts 


ode 
RAM 


Read 


5 


5 


5 


- 


1 


4000 
bits 


Write 


3 


3 


— 


- 


1 


2112 
bits 


ReadWrite 


4 


8 


4 


- 


1 


4864 
bits 


Adder 


4 


16 


12 . 


4 


- 


6528 
bits 


Multiply 


4 


16 


16 


4 


— 


7680 
bits 


Logical 


2 


8 


4 


2 




2752 
bits 


TOTAL 


24 


38 


41 


10 


1 


27936 
bits 



All Units connected to FIFOs produce a control bit. The 
control bits are ORed together to produce the SuspendALUs 
control bit, which is passed as input into every unit. The 
bit will be set if an attempt ±s:: due ::to be. ;:made this cycle 
to access a FIFO which - is not . ready . (e .g. it is being 
written to and it is full) . If set, all ALUs are suspended 
for the cycle, and no processing takes place. Processing 
will, be suspended until the SuspendALUs control bit is clear 
(e.g. if the FIFO is now ready) . This mechanism is provided 
so that synchronization is not an issue. While this does not 
provide optimum performance, it does considerably reduce 
hardware and software (microcode) design complexity. 

The total number of data outputs is 41. This implies 6 
bits are necessary in order to select 1 input from the 
available outputs . 
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The total number of status outputs is 10. Since each 
status output consists of 2 bits (a N (Negative) bit and a Z 
(Zero) bit), there are actually 20 status bits. Consequently 
5 bits are necessary in order to select 1 from the 20 status 
5 bits. 

Memory Interface Units 

In order to transfer data between the various ALUs and 
the memory, a variety of units have been introduced. They 
include Read, Write, and ReadWrite units. 

10 In order to reduce complexity of microcode, all units 

hang if any one of them requires jmemory .vaccess and the FIFO 
is not yet available (for. reading or for writing) . This 
mechanism is provided by the SuspendALUs control bit, 
described in Notei of the previous section. 

15 The memory interface units do not access DRAM nor the 

caches themselves . They merely provide an interface between 
other ALUs and the memory, providing a timing and 
synchronization buffer via the FIFOs. 
Read Unit 

20 The Read Unit provides data. from DRAM. Specif ically, it 

is attached to a FIFO that is filled by a Read Iterator.. The 
Read Unit is attached to the output end of this FIFO, and 
does not concern itself with how data is inserted into the 
FIFO, 

25 The Read Unit structure is set out in Fig. 143 and 1 

data output and no status 'outputs, ~ although :rf a read is 
requested from the FIFO, and the FIFO is empty, then the 
entire ALU microcode is disabled until the FIFO has 
something inside. 

30 Microcode RAM 

The Microcode RAM for the Read Unit is a 32 entry by 25 
bit RAM (800 bits), containing the program for the ALU. The 
meaning of each of the microcode control bits is described 
here: 

35 



Bits 



# Bits 



Description 
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0 


1 


Read from FIFO 


1 


1 


Sign Extendi to . 32 bits (input to 
BarrelShifti) 

0 = no sign extend (pad with O's) 

1 = sign extend 


2-3 


2 


BarrelShifti (shifts left only, padding 
lower bits with 0) 

00 = no shift 

01 = shift left 8 bits 

10 = shift left 16 bits 

11 = shift left 24 bits 


4-7 


4 


Write Enable to Latch (each enable-bit 
represents 1 byte) 


8 


1 


Sign Extend2 (input to Bit Fiddler) 

0 =.no sign extend (pad with O's) 

1 = sign extend 


9-11 


3 


Bit Fiddler (Generates 32 bit number from 
32 bit number ABCD) 

000 = XXXA 

001 = XXXB 

010 = XXXC 

011 - XXXD 

100 = XXAB 

101 = XXBC 

110 = XXCD 

111 = ABCD 


12-13 


2 


BarrelShift2 (shifts left only, padding 
lower bits with 0) 

00 = no shift 

01 = shift left 8 bits 

10 = shift left 16 bits 

11 = shift left 24 bits 


14-18 


5 


Select input status bit to compare against 
(branch if equal) 

00000 - 11101 ■= select input status bit 
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11110 = don't jump (address is next 
microcode word) 

11111 = always jump (regardless of status) 


19 


1 


Value to compare status bit against 
(branch if matches) 


20-24 


5 


Address to jump to (if branching) 




25 


TOTAL 



Write Unit 

The Write Unit is illustrated in Fig. 145 and provides 
5 the interface of writing to ' DRAM to the ALU programs. 
Specifically, it is attached to a FIFO that is read/emptied 
by a Write Iterator. The Write Unit is attached to the input 
end of this FIFO, and does not concern itself with how data 
is removed from the FIFO. 
0 The Write Unit does not output data to any other ALUs, 

although if a write is requested from the FIFO, and the FIFO 
is full, then the SuspendALUs signal is generated until the 
FIFO can be written to. 
Microcode RAM 

5 The Microcode RAM is a 32 entry by 22 bit RAM (704 

bits), containing the program for the ALU. The meaning of 
each of the microcode control bits is r-desGribed here : 



Bits 


# 

-Bit 
s 


Description 


0-5 


6 


Select input from other units 


6 


1 


Write Enable to Latch 


7 


1 


Select INi or data from Latch 


8-9 


2 


8 bit select from 32 bits (ABCD) 

00 = D 

01 = C 
10 = B 
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11 = A 


10 


1 


Write to FIFO 


11-15 


5 


Select input status bit to compare 

against (branch if equal) 

00000 - 11101 = select input status 

bit 

11110 = don't jump (address- is next 
microcode word) 

11111 = always jump (regardless of 
status) 


16 


1 


Value to compare status bit against 
(branch, if matches) 


17-21 


5 


Address to jump to (if branching) 




22 


TOTAL 



ReadWrite Unit 

The ReadWrite Unit is illustrated in Fig. 14 6 and 
provides mechanisms for reading (and writing) into lookup 
5 tables and creating DRAM FIFOs. The ReadWrite Unit has both 
input and output, and attaches to 2 FIFOs that are in turn 
connected .^to address generators that , can interpret requests 
for lookup data. Note that in a single cycle. 

Clients send their requests to the ReadWrite Unit, 

10 which in turn passes their requests into a FIFO. Results 
from the request (in the .case . of a Read request) are then 
read from the second FIFO. Note that, these two FIFOs are not 
the same as the 8 bit FIFOs attached to the Read and Write 
Units. Instead there is a 17 bit output FIFO (1 bit for 

15 request, 16 for data), and a 16 bit input FIFO. 
Microcode RAM 

The Microcode RAM is a 32 entry by 38 bit RAM (1216 bits), 
containing the program for the ALU. The meaning of each of 
the microcode control bits is described here: 



Bits 


# 


Description 




Bit 
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s 




0-5 


6 


Select input from other units 


6 


1 


Write Enable to Latch2 


7 


1 


Select INi or data from Latch2 


8-9 


3 


16 bit select from 32 bits (ABCD) 

000 = OD 

001 .= OC 

010 = OB 

011 = OA 

100 = CD 

101 = BC 

110 = AB 

111 = 0 


10 


1 


Write to FIFO 


11 


1 


Request Bit to send as 17^^" bit in 
input to FIFO 


12 


1 


Read from FIFO 


13-14 


2 


Sign Extendi to 32 bits (input to 
BarrelShifti) 

00 - no sign extend (pad with O's) 

01 = sign extend using bit 7 

10 = sigh extend using bit 15 

11 = reserved 


15-16 


2 


BarrelShifti (shifts left only, padding 
lower bits, with 0) 

00 = no shift 

01 = shift left 8 bits 

10 =. shift left 16 bits 

11 = shift left 24 bits 


17-20 


4 


Write Enable to Latch (each enable-bit 
represents 1 byte) 


21 


1 


Sign Extenda (input to Bit Fiddler) 

0 = no sign extend (pad with O's) 

1 = sign extend 


22-24 


3 


Bit Fiddler (Generates 32 bit number 
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25-26 



27-31 



from 32 bit number ABCD) 



UUU 




"VW "A 

XXXA 


UUl 




XXXB 


010 




XXXC 


Oil 




XXXD 


100 




XXAB 


101 




XXBC 


110 




XXCD 


111 




ABCD 



BarrelShift2 (shifts left only^ padding 
lower bits with 0) 

00 = no shift 

01 = shift left 8 bits 

10 = shift left 16 bits 

11 = shift left 24 bits 



Select input status bit to compare 

against (branch if equal) 

00000 - 11101 = select input status 

bit 

11110 = don't jump (address is next 
microcode word) 

11111 = always jump (regardless of 
status) 



Value to -compare. status: bit against 



32 



(branch if. matches) 



33-37 



Address to jump to (if branching) 



38 



TOTAL 



Processing Units 
Adder ALU 

As illustrated in Fig. 147, each adder ALU is a simple 
32 bit adder with Min and Max functionality, a barrel 
shifter, and 4 registers. 3 sets of 32 bit values as well as 
Negative and Zero status bits are provided as outputs from 
the ALU. In addition, each ALU has a microcode RAM 
containing small programs with limited branching ability. 
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The Adder ALU is designed to perform addition, simple 
averaging (e.g. add 2 numbers and divide by 2), and provide 
mechanisms for looping and control for other ALUs (via 
status bits) . 
Microcode RAM 

The Microcode RAM is a 32 entry by 51 bit RAM (1632 bits), 
containing the program for the ALU. The. meaning of • each of 
the microcode control bits is described . here : 



Bits 


# 

Bits 


Description 


0-5 


6 


Select INi from outputs from this and 
other ALUs 


6-11 


6 


Select IN2 from outputs from this and 
other ALUs 


12-17 


6 


Select IN3 from outputs from this and 
other ALUs 


18-19 


2 


Select OUTi from 4 registers 


20-21 


2 


Select OUT2 from 4 registers 


22-23 


2 


Select register to write to [from 4 
registers] 


24 


1 


WriteEnable to register 

0 = don' t write 

1 = write to specified register 


25 


1 


Select Adderlnputi from. [0, IN2] 


26 


1 


Select Registerlnput from [0, INi] 


27 


1 


Negate Adderlnputi 


28-29 


2 


Select Function [MIN, MAX, +, 
ABS ( + ) ] 


30-31 


2 


Operation resolution [input to MIN, 
MAX, +, TST, ABS] 

00 = 32 bits 

01 = 16 bits 
11 = 8 bits 


32 


1 


Limit to 0 min [input to MIN, MAX, 
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33 



34 



+ ] . 

Treat as signed [input to MAX, MIN, 
and Barrel Shifter] 



Cin [input to +] 



35 



WrapEnable [input to +] 

If set,, addition is allowed to wrap. 

If clear, addition will ceiling and 

floor at appropriate value for the 

resolution and its signed/unsigned 

nature. 



36 



Direction for barrel 
■extended if signed] 

0 = left 

1 = right 



shift [sign 



37-39 



40-44 



45 



.#Bits to 
Shifter] 



shift [input to Barrel 



000 = 


0 


001 = 


1 


010 = 


2 


oil = 


3 


100 = 


4 


101 = 


5 


110 = 


8 


111 = 


16 



.Sel.ect..-;;input . .status ::::bi-t . to compare 

against (branch if equal) 

QOOOO - 11101 = select input status 

bit 

11110 = don't jump (address is next 
microcode word) 

11111 = always jump (regardless of 
status) 



Value to compare status bit against 
(branch if matches) 



46-50 



Address to jump to (if branching) 
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51 



TOTAL 



10 



Multiply ALU 

As illustrated in Fig. 148, each multiply ALU is a 32 bit 
multiply/accumulator. It is designed for high speed 
interpolation and convolving, and includes a barrel shift on 
output for user-specified precision. The Multiply ALU 
therefore has 4 data outputs, and 2 status outputs. 
Microcode RAM 

The Microcode RAM is a 32 entry by 60 bit RAM (1920 bits), 
containing the program .for the ALU. The -meaning of each of 
the microcode control bits- is described .here: : 



Bits 


# 

Bits 


Description 


0-5 


6 


Select INi from outputs from this and other 
ALUs 


6-11 


6 


Select IN2 from outputs from this and other 
ALUs 


12-17 


6 


Select IN3 from outputs from this and other 
ALUs 


18-23 


6 


Select .IN4 from outputs from this and other 
ALUs 


24-25 


2 


Select' OUTi from 4 registers - 


26-27 


2 


Select OUT2 from 4 registers 


28-29 


2 


Select register to write to [from 4 
registers] 


30 


1 


WriteEnable to register 

0 = don' t write 

1 = write to specified register 


31 


1 . 


Select Adderlnputi from [0, IN2] 


32 


1 


Negate Adderlnputi 


33 


1 


Select Registerlnput from [0, INJ 


34-35 


2 


Select 16 bits from Ins 

00 = Low 8 bits (pads high 8 bits with 0) 
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01 = Low 16 bits 

10 = Mid 16 bits 

11 = High 16 bits 


36-37 


2 


Select 16 bits from In4 (see above for bit 
description) 


38 


1 


BitsToNegatei [calculates 1-X] 

0 = Negate low 8 bits only 

1 = Negate all 16 bits 


39 


1 


Select between Multiplylnputi and 1- 
Multiplylnputi 


40 


1 


Treat as Signed [input to +> . *, and Barrel 
Shifter] 

0 = Signed * and + 

1 = Unsigned * and + 


41 


1 


Operation resolution [input only as output 
always 32 bits] 
0 = 16 bits 
1=8 bits 


42 


1 


Cin [input to +] 


43 


1 


Limit to 0 min [input to +] 


44 


1 


WrapEnablei input to +] 

If set, addition is allowed to wrap. 

If clear, addition will ceiling and floor at 

appropriate value for the: resolution and its 

signed/unsigned nature. . 


45 


1 


Direction for barrel shift [sign extended if 
signed] 

0 = left 

1 = right 


46-48 


3 


#Bits to shift [input to Barrel Shifter] 

000 = 0 

001 = 1 

010 = 2 

011 = 3 
100 = 4 
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101 = 5 

110 = 8 

111 =16 


49-53 


5 


Select input status bit to compare against 
(branch if equal) 

00000 - 11101 = select input status bit 

11110 = don't jump (address is next 
microcode word) 

11111 = always jump (regardless of status) 


54 


1 


Value to compare status bit against (branch 
if matches) 


55-59 


5 


Address to jump to (if ;branching) 




60 


TOTAL 



Logical ALU 

As illustrated in Fig. 149, the Logical ALU allows 
simple logical operations such as AND, OR and XOR functions 
5 to be performed. . It is specifically useful for preparing 
operands for interpolation, for merging separately created 
components of a number, and for bit testing in order to 
provide control to other units. Take for example, the case 
of interpolation via lookup. Given an ' 8 bit number, the 

10 lookup may only use 4 bits, and leave the remaining 4 bits 
to provide the interpolation . .; The :^^Lo-gical■^^sALU allows the 
remaining 4 bits to be -isolated. . The Logical ALU therefore 
has 2 data outputs and 1 status output. 
Microcode RAM _ * 

15 The Microcode RAM is a 32 entry by 43 bit RAM (1376 bits), 
containing the program for the ALU. The meaning of each of 
the microcode control bits are described here: 



Bits 


# 

Bit 
s 


Description 


0-5 


6 


Select INi from outputs from this and other 
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ALUs 


6-11 


6 


Select IN2 from outputs from this and other 
ALUs 


12-17 


6 


Select IN3 from outputs from this and other 
ALUs 


18-19 


2 


Select OUTi from 4 registers 


20-21 


2 


Select register to write to [from 4 registers] 


22 


1 


WriteEnable to register 

0 = don' t write 

1 = write to specified register 


23 


1 


Negate INi 


24 


1 


Negate IN2 


25-26 


2 


Select Logical Function 

00 = NOT(INi) 

01 = INi AND IN2 

10 = INi OR IN2 "i 

11 = INi XOR IN2 . 


27 


1 


Direction for barrel shift 

0 = left 

1 = right { 


28 


.1. . 


SignExtend [input to Barrel Shifter] 

0 = no sign extend 

1 = sign extend when shifting right 


29-31 


3 


#Bits to -shift [input to ;Ba'rrei Shifter] 

000 = 0 

001 = 1 

010 =2 - 

011 = 3 

100 = 4 

101 = 5 

110 = 8 

111 =16 


32-36 


5 


Select input status bit to compare against 
(branch if equal) 

00000 - 11101 = select input status bit 
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11110 = don't jump (address is next microcode 
word) 

11111 = always jinap (regardless of status) 


37 


1 


Value to compare status bit against (branch if 
matches) 


38-42 


5 


Address to jump to (if branching) 




43 


TOTAL 



I/O Units 177 

The I/O Units Block .177 is illustrated in further 
detail in Fig. 19 and consists of a .number of types of 
5 address generators, each, linked to, a . specific: FIFO and the 
Cache Interface, The address generators are able to read and 
write data (specifically images in a , variety of formats) as 
well as tables and simulated FIFOs in .DR7\M, They are 
customizable under software control, but cannot be 
10 microcoded. 

The types of address generators are: 

Read Image Iterators 190,' ^ used to iterate through 
pixels of an image in a variety of ways 

Write. Image Iterators 191, used to write pixels of an 
15 image in a variety of ways, and 

Table I/O Units 192, used to randomly access pixels in 
images, data in tables,- and to simulate -FIFOs. • 
There are a total of: 

5 Read Image Iterators 190, each connected to an 8 bit 
20 output FIFO 

3 Write Image Iterators 191, each connected to an 8 bit 
input FIFO 

4 Table I/O Units 192, each connected to a 16 bit 
output FIFO a 17 bit input FIFO 

25 Each of the address generators is connected to one of 

the 7 Cache Interface ports (the 8*^^ is reserved for the 
CPU) . all FIFOs can be accessed by software as memory . mapped 
I/O. 
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Interpolation using ALUs 

Interpolation is heavily used in image creation by the 
ACP, from simple compositing through to tri-linear 
interpolation for colour space conversion. Interpolation is 
5 defined in one of two forms, with the value at fractional 
position f between A and B given by: A- -f (B-A) f , or as A(l- 
f) + fB. 

Both forms reduce to the same implementation. Rather 
than have specific interpolation hardware, it is possible to 

10 microcode interpolation using the ALUs. Interpolation can 
be implemented in a variety - of .ways vusing ' .d^^ numbers 
of ALUs depending on the - other functions - required at the 
same time. . The following is a sample of interpolation 
methods in the general sense only. The method of 

15 interpolation & hence number of ALUs required etc. is 
described as required for each use of interpolation within 
the ACP. 

Both forms can be reduced to the same implementation. 
Rather than have specific interpolation hardware, it is 

20 possible to microcode interpolation using the ALUs. It is 
therefore possible to set up a single Adder ALU and Multiply 
ALU to work in conjunction so that they effectively form a 
pipeline that produces the result of a single interpolation 
every clock cycle after a 2 cycle s.etup.: delay) . Sample 

25 microcode pseudocode for interpolation - of ..a 1 dimensional- 
data. stream (given by Invalue) is: 



Cycle 


Multiplv ALU 


Adder ALU 


1 


A = Invalue 


A = 0 


2 


Mult. Out 1 = A 

Calculate f * Adder. Out 1 + 

Mult.Outi 

B = Invalue 


Outl = A, 

B = Invalue - Mult.Outi 


3 


Mult.Outi = B 

Calculate f * Adder. Outl + 
Mult.Outi 


A = Invalue 
Goto 2 
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Goto 2 



It is also possible to perform interpolation on data 
coming in as pairs of values in a single input stream. In 
this case we only need 1 Multiply ALU, there is a pipeline 
5 delay of 2 cycles, and the process takes 2 cycles on 
average. 



Cyc 
le 


Multiply ALU 


1 


Acc = (1-f) * Invalue 


2 


Mult .Outl = A ' 

Acc = Acc + (f * Invalue) 


3 


Mult •Out = Acc 

Acc = (1-f) * Invalue 

Goto 2 



Pairs of data in 2 streams 

If data is coming in as pairs of values from 2 input 
10 streams, we can get by with 1 Multiply ALU and 1 Adder ALU. 
In this case we can interpolate in 1 cycle on average. 



Cycle 


Multiply ALU 


Adder ALU 


1 


A = Invalue 


A - Invaluel - 
Invalue2 


2 


Mult .Outl = A 
Calculate f * 
Adder. Outl + Mult. Outl 
B = Invalue 


Outl = A, 

B = Invaluel - 
Invalue 2 


3 


Mult. Outl = B 
Calculate f * 
Adder. Outl + Mult, Outl 
A = Invalue 
Goto 2 


Outl = B, 

A = Invaluel - 

Invalue2 

Goto 2 



Bi-linear interpolation 
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In bi-linear interpolation a total of 3 interpolations 
need to be performed: 

2 interpolations between the 2 pairs of data 
. 1 interpolation between the output of the 2 
interpolations 

If the data is coming in from a single stream, we can 
choose for optimizing for speed or ALU usage. If we wish to 
minimize ALU usage, we can perform 1 interpolation per 2 
cycles using a single Multiply ALU, Thus the time required 
for the 3 interpolations is 6 cycles. Alternatively we can 
use 2 Multiply ALUs : perform the 2 - interpolations in 4 
cycles using 1 Multiply •ALU,, and perform - the remaining 
interpolation in 2 cycles with the other Multiply ALU. Since 
the 2 Multiply ALUs work in parallel, the total time for 
tri-linear interpolation would be 4 cycles. 

If the data is coming in from 2 streams, we can again 
optimize for speed or ALU usage. If we wish to minimize ALU 
usage, we can perform 1 interpolation per 2 cycles using a 
single Multiply ALU. . Thus the time required for the 3 
interpolations in a bi-linear interpolation is 6 cycles. We 
can also use 1 Multiply ALU. with an Adder ALU (see Pairs of 
data in 2 streams, detailed above) , giving 1 interpolation 
per cycle (on average) and hence 3 cycles for the bi-linear 
interpolation. Alternatively we can -use -3 Multiply ALUs in 
combination with 3 Adder ALUs to give an average throughput 
of 1 cycles. 

Tri-linear interpolation 

In tri-linear interpolation a total of 7 interpolations 

need to be performed: 

4 interpolations, 1 between each of the 4 pairs of data 
2 interpolations between the output of the 4 

interpolations 

1 interpolation between the output of the 2 
interpolations 

If the data is coming in a single stream, we can choose 
between optimizing for speed or ALU usage. If we wish to 
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minimize 7\LU usage, we can perform 1 interpolation per 2 
cycles using a single Multiply ALU, Thus the time required 
for the 7 interpolations in a tri-linear interpolation is 14 
cycles. Alternatively, we can use 2 Multiply ALUs: perform 
the 4 interpolations in 8 cycles using 1 Multiply ALU, and 
perform the remaining 3 interpolations in 6 cycles using the 
other Multiply ALU. Since the 2 Multiply ALUs work in 
parallel, the total time for tri-linear interpolation would 
be 8 cycles. 

If the data is coming in 2 streams, it is possible to 
again optimize for speed or .ALU :usage . . .If it ^is necessary to 
minimize 7\LU usage, it Is possible to perform 1 
interpolation per 2 cycles using a single Multiply ALU. Thus 
the time required for the 7 interpolations in a tri-linear 
interpolation is 14 .cycles. It is possible to also use 1 
Multiply ALU with an Adder, giving 1 interpolation per cycle 
(on average) and hence 7 cycles for the tri-linear 
interpolation. Alternatively we can use all 4 Multiply ALUs 
in combination with all 4 Adder ALUs to give an average 
throughput of 2 cycles. 

Generation of Coordinates using VLIW Vector Processor 

Some functions that are linked to Write Iterators 

require the X and Y coordinates of the current pixel being 

processed in part of ;the processing, ^pipeline . . Particular 

processing may need to... take place at the : endrof each row, or 

column being processed. 

Each function requiring coordinates will have a 

different pixel calculation time, and as such will have 

slightly different timing for coordinate generation. 

However, The essence and ALU requirements will be the same 

in each instance, however. 

Generate Sequential [X, Y] 

When a process is processing pixels in sequential order 
according to the Sequential Read Iterator (or generating 
pixels and writing them out to a Sequential Write Iterator) , 
the process as shown in Fig. 20 can be used to generate X, Y 
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coordinates. One form of implementation is as shown in Fig. 
21. The coordinate generator counts up to ImageWidth in the 
X ordinate, and once per ImageWidth pixels increments the Y 
ordinate. The following constants of Fig. 21 are set by 
5 software: 



Constant 


Value 


Ki 


ImageWidth 


K2 


ImageHeight (optional) 



The following registers are used to hold temporary 
variables : 



Variable 


Value 


Latchi 


X (starts at 0 each line) 


Latch2 


Y (starts at 0) 



10 The requirements are summarized as follows: 



Requirements 


* + 


+ 


K 


LU 


Iterators 


• General 


0 


3/4 


3/4 


0 


0 


TOTAL 


0 


3/4 


3/4 


0 


0 



Generate Vertical Strip [X, Y] 

The vertical strip generation process is as shown in Fig. 22 
The coordinate generator simply counts up to ImageWidth in 
.15 the X ordinate, and once . per .ImageWidth ;pixels increments 
the Y ordinate. An actual; -implementation 

in Fig, 23, where the following constants are set by 
software: 



Constant 


Value 


Ki 


32 


K2 


ImageWidth 




ImageHeight 



The following registers are used to hold temporary 
20 variables : 



Variable 


Value 


Latchi 


StartX (starts at 0, and ^is incremented by 32 
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once per vertical strip) 


Latch2 


X 


Latchs 


EndX (starts at 32 and is incremented by 32 
to a maximiJin of IitiageWidth) once per vertical 
strip) 


Latch4 


Y . 



The requirements are summarized as follows: 



Requirements 


* + 


+ 


K 


LU 


Iterators 


General 


0 


4 


7 


0 


0 


TOTAL 


0 


4 


7 


0 


0 



CPU Memory Decoder 
5 T he CPU Memory Decoder is a simple decoder for 
satisfying CPU data accesses. The Decoder translates data 
addresses into DRAM addresses (which then get passed on to 
the Cache Interface) or into internal ACP register accesses 
over the internal low speed bus. The CPU Memory Decoder 

10 allows for memory mapped I/O of ACP registers. A 

straightforward way of deciding is to use address bit 24. 
If bit 24 is clear, the address is in the lower 16 MB range, 
and hence can be directed to the Cache Interface to be 
satisfied from DRAM. In most cases the DRAM will only be 8 

1 5 MB, but we allocate 16 MB to : cater for a higher memory model 
Artcam. If bit 24 is. set, . the * address . represents an 
internal ACP register address. The address is translated 
into an access over the low speed bus to the requested 
component in the ACP. 

20 Program Cache 

A small cache is required for good performance. This 
requirement is mostly due to the use of a Rambus DRAM, which 
can provide high-speed data in bursts, but is inefficient 
for single byte accesses. 16 dedicated cache lines of 32 

25 bytes each will achieve most of the performance gain over no 
cache, and limits the cache size to 512 bytes. The program 
cache gives increased performance for the CPU, and even 
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allows small CPU functions to run completely from cache (and 
therefore simultaneously with VLIW processes) . The Program 
Cache is a read only cache, taking its data from the DRAM 
Memory Interface . The data used by CPU programs comes 
5 through the CPU Memory Decoder and if the address is in 
DRAM, through the general Cache Interface. 

Cache Interface 

The ACP contains a dedicated CPU instruction cache and 

10 a general data cache interface. The CPU instruction cache is 
described in the previous: :chapter, v:whi.le\ this chapter 
discusses the general data cache. In order to reduce 
effective memory latency, the ACP contains 128 cache lines. 
Each cache line is 32 bytes wide. Thus the total amount of 

15 data cache is 4096 bytes (4k) . Each cache line has a 4 bit 
group number associated with it, thereby allowing the 
splitting of the caches into 16 different groups. The 
caching groups must be contiguous sets of cache lines. 

All processor data requests use cache request group 0, 

20 and although the CPU can assign, any number, of cache lines 
(except none) to cache group 0, a minimum of 16 cache lines* 
is recommended for good performance. 

The other users of the cache interface - namely the 
Artcard Interface, the Display .Controller;: r .and the VLIW 

25 Vector Processor must use - cache request groups 

..appropriately. The CPU .:is -.responsible for ensuring that a 

correct number of cache lines is assigned to each cache 
group for a given process. In any given cycle, 4 
simultaneous accesses of 32 bits (4 bytes) to the caches are 

30 permitted. Each access must be to a separate group of cache 
lines. 

Serial Interfaces 
USB serial port interface 
35 This is a standard USB serial port, which is connected 

to the internal chip low speed bus, thereby allowing the CPU 
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to control it. 

Keyboard interface 

This is a standard low-speed serial port, which is 
5 connected to the internal chip low speed bus, thereby 
allowing the CPU to control it. It is designed -to be 
optionally connected to a keyboard to , allow simple data 
input to customize prints. 

10 Authentication chip serial interfaces 

These are 2 standard .vlow-speed-..:5erial ports,;, which are 
connected to the internal chip low .. .speed- bus, thereby 
allowing the CPU to control them. . 

The reason for having 2 ports is to connect to both the 

15 -on-camera Authentication chip, and to the print-roll 
Authentication chip using separate lines. Only using 1 line 
may make it possible for a clone print-roll manufacturer to 
design a chip which, instead of generating an authentication 
code, tricks the camera into using the code generated by the 

20 authentication chip in the camera. 
Parallel Interface 

The parallel interface connects the ACP to individual 
static electrical signals. The following is a table of 
connections to the parallel, interface;: . . ... 

25 



Connection 


Direction 


Pins 


Paper transport stepper motor 


Output 


4 


Artcard stepper motor 


Output 


4 


Zoom stepper motor 


Output 


4 


Guillotine solenoid 


Output 


1 


Flash trigger 


Output 


1 


Status LCD segment drivers 


Output 


7 


Status LCD common drivers 


Output 


4 


Artcard illumination LED 


Output 


1 


Artcard status LED (red/green) 


Input 


2 



- 68 - 



Artcard sensor 


Input 


1 


Paper pull sensor 


Input 


1 


Orientation sensor 


Input 


2 


Buttons 


Input 


4 


Total 




36 



The CPU is able to control each of these connections as 
memory mapped I/O via the low speed bus. 



5 Display Controller 

Principles of Operation 

When' the ''Take" button on an Artcam is half depressed, 

the TFT will display the current image from the image sensor 

(converted via a simple VLIW process) . Once the Take button 
10 is fully depressed, the Taken Image is displayed. 

When the user presses the Print button and image 

processing begins, the TFT is turned off. Once the image has 

been printed the TFT is turned on again. 

Structural Overview 
15 The Display Controller is used in those Artcam models 

that incorporate a flat panel display. An example display is 

a TFT LCD of resolution 240 x 160 pixels. The Display 

Controller has the following structure: 

The Display Controller State Machine contains registers 
20 that control the timing of the Sync Generation, where the 

display image is to be taken from (in DRAM via the Cache 

Interface), and whether the TFT should be active or not (via 

TFT Enable) at the moment. The CPU can write to those 

registers via the low speed bus. 
25 Displaying a 2 40 x 160 pixel image on an RGB TFT 

requires 3 components per pixel. The image taken from DR7\M 

is displayed via 3 DACs, one for each of the R, G, and B 

output signals. 

At an image refresh rate of 30 frames per second (60 
30 fields per second) the Display Controller requires data 

transfer rates of: 
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240 X 160 X 3 X 30 = 3.5MB per second 
This data rate is low compared to the rest of the 
system. However it is high enough to cause VLIW programs to 
slow down during the intensive image processing. The general 
princliples of TFT operation should reflect this. 
CPU Core (CPU) 

The CPU core 72 can be any processor core with 
sufficient processing power to perform the required core 
calculations and control functions fast enough to met 
consumer expectations. Examples of suitable cores are: 

MIPS R4000 core from LSI Logic 

StrongARM core 

The Artcam is deliberately designed so that the core 
processor 72 can be changed at any stage while maintaining 
complete compatibility. To use a different core, the Vark 
interpreter and camera control programs must be re-compiled 
for the new processor instruction set. This is a 
straightforward task if the Vark interpreter is written in a 
high level language (preferably C++) with no assembler. 

The Vark language preferably makes no assumptions about 
the CPU, and is completely portable. Therefore any Artcards 
will work with any CPU cores which meet the performance 
specifications. As a result of this device independence, 
future Artcam models can take advantage ..of .:new processor 
cores as they are developed. Also, different ACP chip 
designs may be fabricated by -different manufacturers, 
without the need to license or port the CPU core. 
Program Cache 75 

A small cache 75 is required for good performance. This 
requirement is mostly due to the use of a Rambus DRT^, which 
can provide high-speed data in bursts, but is inefficient 
for single byte accesses. 16 dedicated cache lines of 32 
bytes each will achieve most of the performance gain over no 
cache, and limits the cache size to 512 bytes. 
Data Cache 7 6 

As with the program cache 75, a small cache 7 6 is 
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required for good performance. This requirement is again 
mostly due to the use of a Rambus DRAM, which can provide 
high-speed data in bursts, but is inefficient for single 
byte accesses. 16 dedicated cache lines of 32 bytes each 
5 will achieve most of the performance gain over no cache, and 
limits the cache size to 512 bytes. 
Image Sensor Interface (ISI) 83 

The Image Sensor Interface (ISI) 83 takes data from the 
CCD and makes it available for storage in DRAM. The CCD can 
10 be is a 3:2 aspect ratio image sensor, typically 750 x 500, 
yielding 375K (8 bits per pixel) . Fig. 24 illustrates the 
configuration of a single pixel. 

As illustrated in simplified form in Fig, 25, the ISI 83 
includes a state machine that sends control information to 
15 the .CCD 2 (Fig.2) , including frame sync pulses and pixel 
clock pulses in order to read the image. Pixels are read 
from the CCD via a sub-ranging semi-flash DAC, and placed 
into the VLIW Input FIFO. The VLIW is then able to process 
and/or store the pixels, which are then available for 
20 processing and/or storage. 

The ISI 83 is used in conjunction with a VLIW microcode 
program that stores the CCD image in DR7\M. Processing occurs 
in 2 steps: 

1, :A small. VLIW program reads the. pixels from the FIFO 192 
25 and writes them to the DRAM via a Sequential Write 

Iterator. 

2. The CCD image in DRAM is rotated 90, 180 or 270 degrees 
according to the orientation of the camera when the photo 
was taken. 

30 If the rotation is 0 degrees, then step 1 merely writes 

the CCD image out to the final CCD image location and step 2 
is not performed. If the rotation is non-0 degrees, the 
image is written out to a temporary area (for example into 
the print image memory area) , and then rotated during step 2 

35 into the final CCD image location. Step 1 is very simple 
microcode, taking data from the VLIW Input FIFO 192 and 
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writing it to a Sequential Write Iterator. Step 2's rotation 
is accomplished by using the accelerated Vark Affine 
Transform function. The processing is performed in 2 steps 
in order to reduce design complexity and to re-use the Vark 
5 affine transform rotate logic already required for images. 
This is acceptable since both steps are completed in. less 
than 0.03 seconds, a time imperceptible to the operator of 
the Artcam. Even so, the read process is CCD speed bound, 
taking 0.02 seconds to read the full frame. The time taken 

10 to rotate the image can be 2 cycles per output pixel, which 
is 750,000 cycles, or 0.008 seconds . The total : time for both 
stages is therefore 0.028 .seconds. 

The orientation will be important for converting 
between the CCD image and the internal format image, since 

15 the relative positioning -of R, G, ,and B pixels changes with 
orientation. The processed image may also have to be 
rotated during the Print process in order to be in the 
correct orientation for printing. 

On the optional 3D model of the Artcam there are 2 

20 CCDs, with their inputs multiplexed to . a single ISI 
(different microcode, but same ACP) . If the CCD has a frame 
store both frames can be taken simultaneously, and then 
transferred to memory one at a time. If the CCD has a line 
store, the frames can be transferred :one- line time in a 

25 multiplexed fashion. 

Display Controller 88 

The display controller 88 is used in those Artcam 
models that incorporate a flat panel display. An example 
display is a TFT LCD of resolution 240 x 160 pixels. This 

30 type of display would require a low data-rate. 

When the ''Take" button is half depressed, the TFT would 
display the current image from the image sensor. Once taken, 
the Taken Image would be displayed in its processed form. 
Artcard Interface (AI) 87 

35 The Artcard Interface (AI) 87 is responsible for taking 

an Artcard image from the Artcard Reader 34 , and decoding 
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it into the original data (usually a Vark script) . 
Specifically, the AI 87 accepts signals from the Artcard 
scanner linear CCD 34 , detects the bit pattern printed on 
the card, and converts the bit pattern into the original 
5 data, correcting read errors. 

With no Artcard 9 inserted, the image printed from an 
Artcam 30 is simply the sensed Photo Image cleaned up by any 
standard image processing routines. The Artcard 9 is - the 
means by which users are able to modify a photo before 

10 printing it out. By the simple task of inserting a specific 
Artcard 9 into an Artcam 30, a user is able to define 
complex image processing to :be performed on the Photo Image, 
With no Artcard 30 inserted the Photo Image is processed in 
a standard way to create the Print Image. 

15 When a single Artcard 9 is inserted into the -Artcam, 

that Artcard' s effect is applied to the Photo Image to 
generate the Print Image, 

When the Artcard 9 is removed (ejected), the printed image 
reverts to the Photo Image processed in a standard way. 

20 When the user presses the button to eject an Artcard, an 
event is placed in the event queue maintained by the 
operating system running on the ACP72, When the event is 
processed (for example after the current Print. has 
occurred), the following things occur: 

25 If the current Artcard is valid, then the Print Image 

is marked as invalid and .a- ^Process Standard' event is 
placed in the event queue. When the event is eventually 
processed it will perform the standard image processing 
operations on the Photo Image to produce the Print Image . 

30 The motor is started to eject the Artcard and a time- 
specific ^Stop-Motor' Event is added to the event queue. 
Inserting an Artcard 

When a user inserts an Artcard 9, the Artcard Sensor 49 
detects it notifying the ACP72, This results in the 

35 software inserting an ^Artcard Inserted' event into the 
event queue. When the event is processed several things 
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occur: 

The current Artcard is marked as invalid (as opposed to 
^none' ) . 

The Print Image is marked as invalid. 
5 The Artcard motor 37 is started up to load the Artcard 

The Artcard . Interface 87 is instructed to read the 
Artcard 

The Artcard Interface 8,7 accepts signals from the 
Artcard scanner ^Linear CCD 34, detects the bit pattern 

10 printed on the card, and corrects errors in the detected bit 
pattern, producing a valid Artcard data ::block in DRAM. 
Reading Data from the Artcard CCD General Considerations 

- As illustrated in Fig.. 26, the Artcard 9 must be 
sampled at least at double the printed resolution to satisfy 

15 Nyquist's Theorem. In practice it is better to sample at a 
higher rate than this. Preferably, the pixels sampled at 3 
times the resolution of a printed dot in each dimension, 
requiring 9 pixels to define a single dot eg 230. Thus if 
the resolution of the artcard 9 is 1600 dpi, and the 

20 resolution of the sensor 34 is 4800 dpi, then using a 50mm 
CCD image sensor results in 9450 pixels per column. 
Therefore if we require 2MB of dot data (at 9 pixels per 
dot) then this requires 2MB*8*9/9450 = 15,978 columns = 
approximately 16,000 columns. Of course if:: a dot is not 

25 exactly aligned with the sampling CCD the - worst and most 
likely case is that a dot 'will be : sensed^ over a 16 pixel 
area (4x4) 231. 

An Artcard 9 may be slightly warped due to heat damage, 
slightly rotated (up to, say 1 degree) due to differences in 

30 insertion into an Artcard reader, and can have slight 
differences in true data rate due to fluctuations in the 
speed of the reader motor 37. These changes will cause 
columns of data from the card not to be read as 
corresponding columns of pixel data. As illustrated in Fig. 

35 28, a 1 degree rotation in the Artcard 9 can cause the 
pixels from a column on the card to be read as pixels across 
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166 columns: 

Finally, the Artcard 9 should be read in a reasonable 
amount of time with, respect to the human operator. The data 
on the Artcard covers most of the Artcard surface, so we can 
limit our timing concerns to the Artcard data itself. A 
reading time of 1.5 seconds is adequate for Artcard reading. 

The Artcard should be loaded in 1.5 seconds. Therefore 
all 16, 000 columns of pixel data must be read from the CCD 
34 in 1.5 second,' i.e. 10,667 columns per second. Therefore 
the time available to read one column is 1/10667 seconds, or 
93,747ns- Pixel data can be written to the DRAM 1 column at 
a time, completely independently from any .processes that are 
reading the pixel. data. 

The time to write one column of data (9450/2 bytes 
since - the . reading can be 4 bits per pixel, giving 2x4 bit 
pixels per byte) to DRAM is reduced by using 8 cache lines. 
■If 4 lines were written out at one time, the 4 banks can be 
written to independently and thus overlap latency reduced. 
Thus the 4725 bytes can be written in 11, 840ns (4725/128 * 
320ns). Thus the time taken to write a given column^ s data 
to DRAM uses just under 13% of the available bandwidth. 
Decoding an Artcard 

A simple look at the data sizes shows the impossibility 
of fitting the process, into the -.8MB :of ^-^memory 33 if the 
entire Artcard pixel data (140 MB .if -each /bit is read as a 
3x3 array) as read: by the. linear. .CCD 34 ; is >kept . For' this 
reason, the reading of the linear CCD, decoding of the 
bitmap, and the un-bitmap process should take place in real- 
time (while the T^tcard 9 is travelling past the linear CCD 
34), and these processes must effectively work without 
having entire data stores available. 

When an Artcard 9 is inserted, the old stored Print 
Image and any expanded Photo Image becomes invalid. The new 
Artcard 9 can contain directions for creating a new image 
based on the currently captured Photo Image. The old Print 
Image is invalid, and the area holding expanded Photo Image 
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data and image pyramid is invalid, leaving more than 5MB 
that can be used as scratch memory during the read process. 
Strictly speaking, the 1MB area where the Artcard raw data 
is to be written can also be used as scratch data during the 
Artcard read process as long as by the time the final Reed- 
Solomon decode is to occur, that 1MB area is free again. The 
reading process described here does not make use of the 
extra 1MB area (except. as a -final destination for the data). 

■ It should also be noted that the unscrambling process 
requires two sets of 2MB areas of memory since unscrambling 
cannot occur in place. Fortunately the -SMB. scratch area 
contains enough space for this process. 

Turning now to Fig. 27, there is shown a " flowchart 220 of 
the steps necessary to decode the Artcard data. These steps 
include reading in the artcard 221, decoding the read data 
to produce, corresponding encoded XORed scrambled bitmap data 
223. Next a checkerboard XOR is applied to the data to 
produces encoded scrambled data 224. This data' is then 
unscrambled 227 to produce data 225 before this data is 
subjected to Reed-Solomon decoding to produce the original 
raw data 226. Alternatively, unscrambling and XOR process 
can take place together, not requiring a separate pass of 
the data. Each of the above steps is discussed in further 
detail hereinafter- The Artcard . interface, -..therefore, has 4 
phases, the first 2 of which are time-critical, and must 
take place while pixel .da£a.,.ds ./being.- read-.f-rom . the CCD: 
Phase 1. Detect data area on Artcard 

Phase 2. Detect bit pattern from Artcard based 

on CCD pixels, and write as bytes. 

Phase 3. Descramble and XOR the byte-pattern 

Phase 4. Decode data (Reed-Solomon decode) 

Fig. 29 illustrates a timeline 240 of the pixel reading 
process and the four phases which are as follows: 

Phase 1. As the Artcard 9 moves past the CCD 34 the AI 
must detect the start of the data area by robustly detecting 
special targets on the Artcard to the left of the data area. 
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If. these cannot be detected, the card is marked as invalid. 
The detection must occur in real-time, while the Artcard 9 
is moving past the CCD 34. 

Phase 2. Once the data area has been determined, the 
main read process begins, placing pixel data from the CCD 
into an 'Artcard data window' , detecting bits from this 
window, assembling the detected bits into bytes, and 
constructing a byte-image in DR7\M. This must all be done 
while the Artcard is moving past the CCD. 

Phase 3. Once all the pixels have been read from the 
Artcard data area, the :Art card , motor.. 37 ;can:±)e - stopped, and 
the byte image descrambled and. XORed. -Although, not requiring 
real-time -performance, the; process should be fast enough not 
to annoy the human operator. The process must take 2 MB of 
scrambled bit-image and write the unscrambled/XORed bit- 
image to a separate 2MB image. . ^ 

Phase 4. The final phase in the Artcard read process- is 
the Reed-Solomon decoding process, where the 2MB bit-image 
is decoded into a 1MB valid Artcard data area. Again, while 
not requiring real-time performance it is still necessary^' to 
decode quickly with regard to the human operator. If the 
decode process is valid, the card is marked as valid. If the 
decode failed, any duplicates of data in the bit-image are 
attempted to be decoded, a. -process ^.that^.:is„.xepeated until 
success or until there are no more duplicate images of the 
data in the bit image. 

The 4 phase process described requires 4.5 MB of DRAM. 
2MB is reserved for. Phase 2 output, and 0.5MB is reserved 
for scratch data during phases 1 and 2. The remaining 2MB of 
space can hold over 44 0 columns at 4725 byes per column. In 
practice, the pixel data being read is a few columns ahead 
of the phase 1 algorithm, and in the worst case, about 180 
columns behind phase 2, comfortably inside the 440 column 
limit . 

A description of the actual operation of each phase 
will now be provided in greater detail. 
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Phase 1 - Detect data area on Artcard 

This phase is concerned with robustly detecting the left- 
hand side of the data area on the Artcard 9. Accurate 
detection of the data area is achieved by accurate detection 
of special targets printed on the left side of the card. 
These targets are especially designed to be easy to detect 
even if rotated up to 1 degree. 

Turning to Fig. 30, there is shown an enlargement of the 
left hand side of an Artcard 9. The side of the card is 
divided into 16 bands, eg with a target 241 located at the 
center of each band. The ^bands are logical in that there is 
no line 242 drawn to separate bands. Turning to Fig. 31, 
there is shown a single target 241. The target 241, is a 
printed, black square containing a single white dot. The 
idea is to detect firstly as many targets 241 as possible, 
and then to* join at least 8 of the detected white-dot 
locations into a single logical straight line. If this can 
be done 243 ^is set, the data area is a fixed distance from 
this logical line. If it cannot be done, then the card is 
rejected as invalid. 

Returning to Fig. 30, the height of the card 9 is 3150 
dots. A target (TargetO) 241 is placed a fixed distance of 
24 dots away from the top left corner 244 of the data area 
so that it falls well within the first of 16 equal sized 
regions 239 of 192 dots (576 pixels) with . no target in the 
final pixel region of the card 9 . The target 241 must be 
big enough to be easy to detect, yet be small enough not to 
go outside the height of the region if the card is rotated 1 
degree. A suitable size for the target is a 31 x 31 dot (93 
X 93 pixels) black square 241 with the white dot 242, 

At the worst rotation of 1 degree, we get a 1 column 
shift every 57 pixels. Therefore in a 590 pixel sized band, 
we cannot place any part of our symbol in the top or bottom 
12 pixels or so of the band or they could be detected in the 
wrong band at CCD read time if the card is worst case 
rotated. 
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. Therefore, if the black part of the rectangle is 57 
pixels high (19 dots) we can be sure that at least 9,5 black 
pixels will be read in the same column by the CCD (worst 
case is half the pixels are in one column and half in the 
next) . To be sure of reading at least 10 black dots in the 
same column, we must have a height of 20 dots. To give room 
for erroneous detection on the edge of the start of the 
black dots, we increase the number of dots to 31, giving us 
15 on either side of the white dot at the target's local 
coordinate (15, 15) . 31 dots is 91 pixels, which at most 
suffers a 3 pixel s;hif t - in .column, easily ...within the 576 
pixel band. 

Thus each target is a block of 31 x 31 dots (93 x 93 
pixels) each with the composition: 

15 columns of 31 black dots each . (45 pixel width columns of 
93 pixels) . 

1 column of 15 black dots (45 pixels) followed by 1 
white dot (3 pixels) and then a further 15 black dots (45 
pixels) 

15 columns of 31 black dots each (45 pixel width 
coliimns of 93 pixels) 
Detect targets 

Targets are detected by reading columns of pixels, one 
column at a time rather . than by .. detecting dots. It is 
necessary to look within a given band ■ for a number of 
columns consisting of ■ . large -nximbers of - contiguous black 
pixels to build up the left side of a target. Next, it is 
expected to see a white region in the center of further 
black columns, and finally the black columns to the left of 
the target center. 

Eight cache lines are required for. good cache 
performance on the reading of the pixels. Each logical read 
fills 4 cache lines via 4 sub-reads while the other 4 cache- 
lines are being used. This effectively uses up 13% of the 
available RDRAM bandwidth. 

As illustrated in Fig. 33, the detection mechanism FIFO 
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for detecting the targets uses a filter 245, run-length 
encoder 24 6, and a FIFO 247 that requires special wiring of 
the top 3 elements (SI, S2, and S3) for random access. 

The columns of input pixels are processed one at a time 
5 until either all the targets are found, or until a specified 
number of columns have been processed. To process a column 
the pixels are read from DRAM, passed through a filter 245 
to detect a 0 or 1, and then run length encoded 24 6. The bit 
value and the number of contiguous bits of the same value 

10 are placed inFIFO FIFOthe FIFO 247. Each entry of the FIFO 
249 is in 8 bits. 7 bits 50 to -hold the run-length, and 1 
bit 249 to hold the value of. the bit detected. 

The run-length .encoder 246 only encodes contiguous 
pixels within a 576 pixel (192 dot) region. 

15 The top 3 elements in the FIFO 247 can be accessed 252 

in any random order. The run lengths (in pixels) of these 
entries are filtered into 3 values: short, medium, and long 
in accordance with the following table: 



Short 


Used to detect white dot. 


RunLength < 16 


Medium 


Used to detect runs of 
black above or below the 
white dot in the center 
of the target. 


1 6<- RunLength 
< 48 


Long 


Used to detect run 
lengths of black to the 
left and right of the 
center dot in the target. 


RunLength >= 48 



Looking at the top three entries in the FIFO 247 there are 3 
specific cases of interest; 



Case 1 


SI 


= white 


long 


We have detected a black 




S2 


= black 


long 


column of the target to the 




S3 




white 


left of or to the right of 
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medium/ long 


the white center dot. 


Case 2 


51 = white long 

52 = black 
medium 

53 = white short 
Previous 8 
columns were 
Case 1 


If we've been processing a 
series of columns of Case 
Is, then we have probably 
detected the white dot in 
this column. We know that 
the next entry will be black 

(or ft would have been 
included in the white S3 
entry) , but the number of 
black pixels is in question. 
Need to verify by checking 
after the next FIFO advance 

(see Case 3) . 


Case 3 


Prev = Case 2 
S3 = black med 


We have detected part of the 
white dot. We expect around 
3 of these, and then some 
more columns of Case 1. 



Preferably, the following information per region band is 
kept : 



TargetDetected 


1 bit 


BlackDetectCount 


4 bits 


WhiteDetectCount 


3 bits 


PrevColumnStartP 
ixel 


15 bits 


Targe tColumn 
ordinate 


16 bits (15:1) 


TargetRow 
ordinate 


16 bits (15:1) 


TOTAL 


7 bytes (rounded to 8 bytes 
for easy addressing) 
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Given a total of 7 bytes, it makes address generation 
easier if the total is assiomed to be 8 bytes. Thus 16 
entries requires 16 * 8 = 12 8 bytes, which fits in 4 cache 
lines. The address range would be inside the scratch 0.5MB 
DRAM area since other phases make use of the remaining 4MB 
data area. 

When beginning to process a given pixel column, the 
register value S2StartPixel 254 is reset tb"0. "As ent"ries"~ih 
the FIFO advance from S2 to SI, they are also added 255 to 
the existing S2StartPixel value, giving the exact pixel 
position of the run currently defined in S2. iooking at each 
of the 3 cases of interest in the FIFO, S2StartPixel can be 
used to determine the start of the black area of a target 
(Cases 1 and 2), and also the start of the white dot in the 
center of the target (Case 3) . An. algorithm for processing 
columns can be as follows: 



1 


TargetDetected[0-15] :== 0 
BlackDetectCount [0-15] := 0 
WhiteDetectCount [0-15] := 0 
TargetRow[0-15] := 0 • 
TargetColumn[0-15] := 0 
PrevColStartPixel [0-15] := 0 
CurrentColumn := 0 


2 


Do ProcessColumn 


3 


CurrentColumn++ 


. 4 


If (CurrentColumn <= LastValidColumn) 
Goto 2 



The steps involved in the processing a column (Process 
Coliamn) are as follows: 



S2StartPixel := 0 
FIFO := 0 

BlackDetectCount := 0 
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WhiteDetectCount := 0 
ThisColuinnDetected := FALSE 
PrevCaseWasCase2 := FALSE 


2 


If (! TargetDetected [Target] ) & {! 
ColioinnDetected [Target] ) 

ProcessCases 
Endlf 


3 


PrevCaseWasCase2 := Case=2 


4 


Advance FIFO 



The processing for each of the 3 (Process Cases) cases 
is as follows: 
Case 1: 



BlackDetectCount [target] < 8 


A := ABS(S2StartPixel - PrevColStartPixelUargeq) 


OR 


lf{0<=A<2) 


WhiteDetectCount [Target] = 0 


BlackDetectCountJTargeq++ (max value =8) 




Else 




BlackDetectCountn'argefl := 1 




WhiteDetectCk)untrrargef| := 0 




Endlf 




PrevColStartPlxelfrarget] := S2StartPixel 




ColumnDetectedfrarget] := TRUE 




BitDetected = 1 


BlackDetectCountttargeq >= 8 . . . 


RrevColStartPixel[rargeq := S2StartPlxel 


WhiteDetectCountfTargeq i= 0 


ColumnDetectedrrargefl := TRUE 




BitDetected = 1 




TargetDetectedrrargeq := TRUE 




TargetCoiumnrrargeq := CunrentColumn - 8 - 




(WhiteDetectCount[Targeg/2) 



Case 2: 

No special processing is recorded except for setting 
the 'PrevCaseWasCase2' flag for identifying Case 3 (see Step 
3 of processing a coliomn described above) 
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Case 3 : 



PrevCaseWasCase2 = TRUE 


If (WhiteDetectCountjTarget] < 2) 


BlackDetectCount [Target] >= 


TargetRow[Targefl = S2StartPixel +(S2r,^^^2) 


8 


Endlf 


Whi t e Dp t" r* 1" Conn "h = 1 


A := ABS(S2StartPixel - PrevColStartPixefUarget]) 




lf(0<=A<2) 




WhiteDetectCountITarget]++ 




Else 




WhteDetectCountU arget] := 1 




Endlf 




PrevColStartPixel|Targeq := S2StartPixel 




ThisColumnDetected := TRUE 




BItDetected = 0 



At the end of processing a given column, a comparison 
5 is made of the current column to the maximum number of 
columns for target detection. If the number of columns 
allowed has been exceeded, then it is necessary to check how 
many targets have been found. If fewer than 8 have been 
found, the card is considered invalid. 
10 Process targets 

, After the targets have been detected, they should be 
processed. All the targets may be available or merely some 
of them. Some targets may also, have been erroneously 
detected. 

15 This phase of processing is to determine a mathematical 

line that passes through the center of as many targets as 
possible. The more targets that the line passes through, the 
more confident the target position has been found. The limit 
is set to be 8 targets. If a line passes through at least 8 

20 targets, then it is taken to be the right one. 

It is alright to take a brute-force but straightforward 
approach since there is the time to do so (see below) , and 
lowering complexity makes testing easier. It is necessary to 
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determine the line between targets 0 and 1 (if both targets 
are considered valid) and then determine how many targets 
fall on this line. Then we determine the line between 
targets 0 and 2, and repeat the process. Eventually we do 
5 the same for the line between targets 1 and 2, 1 and 3 etc 
and finally for the line between targets 14 and 15. Assuming 
all the targets have been found, we need to perform 
15+14+13+ 90 sets of calculations . (with each set of 

calculations requiring 16 tests = 1440 actual calculations), 
10 and choose the line which has the maximum number of targets 
found along the line. The algorithm for target . location can 
be as follows: 
Targe tA := 0 

Max Found := 0 
15 BestLine := 0 

While (TargetA < 15) 

If (TargetA is Valid) 
TargetB:= TargetA + 1 
While (TargetB<= 15) 
20 If (TargetB is valid) 

CurrentLine := line between TargetA and TargetB 

TargetC := 0; 

While (TargetC <= 15) 

If (TargetC valid AND TargetC on line AB) 
25 TargetsHit++ 

Endlf 

If (TargetsHit > MaxFound) 

MaxFound := TargetsHit 
BestLine := CurrentLine 

30 Endlf 

TargetC++ 
EndWhile 

Endlf 

TargetB ++ 
35 EndWhile 
Endlf 
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TargetA++ 
EndWhile 

If (MaxFound < 8) 
5 Card is Invalid 

Else - 

Store expected centroids for rows based on BestLine 

Endlf 

As illustrated in Fig. 33, in the algorithm above, to 
10 determine a CurrentLine 260 from Target A 2 61 and target B, 
it is necessary to calculate Arow 264 & Acolumn 265 between 
targets 261, 262, and the location of Target A. It is then 
possible to move from Target 0 to Target 1 etc by adding r 
and Acolumn . The found (if actually found) location of 
15 target N can be compared to the calculated expected position 
of Target N on the line, and if it falls within the 
tolerance, then Target N is determined to be on the line. 
To calculate Arow & D column: 

Arow = (rOWTargetA " rOW Target b) / (B-A) 

20 Acolumn = (columnjargetA - columnTargete) / (B-A) 

Then we calculate the position of TargetO: 
row = rowTargetA - (A * Arow) 
column = columnTargetA - (A * Acolumn ) 

And compare (row, column) against the actual rowxargeto 

25 and columnxargeto . To move from one expected target to the next 
(e.g. from TargetO to Targetl) , we simply add Arow and 
Acolumn to row and column respectively. To check if each 
target is on the line, we must calculate the expected 
position of TargetO, and then perform one add and one 

30 * comparison for each target ordinate. 

At the end of comparing all 16 targets against a 
maximum of 90 lines, the result is the best line through the 
valid targets. If that line passes through at least 8 
targets (i.e. MaxFound >= 8) , it can be said that enough 
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targets have been found, to form a line, and thus the card 
can be processed. If the best line passes through fewer than 
8, then the card is considered invalid. 

The resulting algorithm takes 180 divides to calculate 
5 Arow and Acolumn , 180 multiply/ adds to calculate targetO 
position, and then 2880 adds/comparisons. The time we have 
to perform this processing is the time taken to read 36 
columns of pixel data = 3,374,892ns. Not even accounting for 
the fact that an add takes less time than a divide, it is 

10 necessary to perform 3240 mathematical operations in 
3,374,892ns. That gives approximately 1040ns per operation, 
or 104 cycles. The CPU can therefore safely perform the 
entire processing of targets, reducing complexity of design. 
Update centroids based on data edge border and 

15 clockmarks 

Step 0: Locate the data area 

From Target 0 (241 of . Fig. 30) it is a predetermined 
fixed distance in rows and columns to the top left border 
244 of the data area, and then a further 1 dot column to the 

20 vertical clock marks 273. So we use TargetA, Arow and 
Acolumn found in the previous stage (Arow and Acolumn 
refer . to distances between targets) to calculate the 
centroid or expected location for TargetO as described 
previously. 

25 Since the fixed pixel':;offset from TargetO to the data 

area is related to the distance between targets (192 dots 
between targets, and 24 dots between TargetO and the data 
area 243), simply add Arow/8 to TargetO' s centroid column 
coordinate (aspect ratio of dots is 1:1). Thus the top co- 

30 ordinate can be defined as: 

columnootcoiumnTop = CO lumnxargeto + (Arow/8) 

rOWootColumnTop = ^OWxarqet 0 + (AcOlumn /8) 

Next Arow and Acolumn are updated to give the number 
of pixels between dots in a single column (instead of 
35 between targets) by dividing them by the number of dots 



- 87 - 

between targets: 

Arow = Arow/192 
Acolumn = Acoluinn /192 

We also set the currentColumn register (see Phase 2) to 
5 be -1 so that after step 2, when phase 2 begins, the 
currentColumn register will increment from -1 to 0. 
Step 1 : Write out the initial centroid deltas (D) and bit 
history 

This simply involves writing setup information required 
10 for Phase 2. 

This can be achieved by writing Os to. all the Arow and 
Acolumn. entries for each row, and a bit history. The bit 
history is actually an expected bit history since it is 
known that to the left of the clock mark column 27 5 is a 

15 border column 277, and before that, a white area. The bit 
history therefore is Oil, 010, Oil, 010 etc. 
Step 2: Update the centroids based on actual pixels read. 

The bit history is set up in Step 1 according to the 
expected clock marks and data border. The actual centroids 

20 for each dot row can now be more accurately set (they were 
initially 0) by comparing the expected data against the 
actual pixel values. The centroid updating mechanism is 
achieved by simply performing step 3 of Phase 2. 
Phase 2 - Detect bit pattern from Artcard .based on pixels 

25 read, and write as bytes . 

Since a dot from the Artcard 9 requires a minimum of 9 
sensed pixels over 3 columns to be represented, there is 
little point in performing dot detection calculations every 
sensed pixel column. It is better to average the time 

30 required for processing over the average dot occurrence, and 
thus make the most of the available processing time. This 
allows processing of a coliomn of dots from an Artcard 9 in 
the time it takes to read 3 columns of data from the 
Artcard. Although the most likely case is that it takes 4 

35 columns to represent a dot, the 4"^*" column will be the last 
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column of one dot and the first coluinn of a next dot. 
Processing should therefore be limited to only 3 columns. 

As the pixels from the CCD are written to the DRAM in 
13% of the time available, 83% of the time is available for 
5 processing of 1 column of dots i.e. 83% of (93,747*3) = 83% 
of 281,241ns = 233,430ns. 

In the available time, it is necessary to detect 3150 
dots, and write their bit values . into : the raw data area of 
memory. The processing therefore requires the following 
10 steps: 

For each column of dots on the Artcard: 

Step 0: Advance to the next dot column 

Step 1: Detect the top and bottom of an Artcard dot 
column (check clock marks) 
15 .Step .2: Process the dot column,, detecting bits and 

storing them appropriately 

Step 3: Update the centroids 

Since we are processing the Artcard' s logical dot 
columns, and these may shift over 165 pixels, the worst case 

20 is that we cannot process the first column until at least 
165 columns have been read into DRAM. Phase 2 would 
therefore finish the same amount of time after the read 
process had terminated. The worst case time is: 165 * 
93,747ns = 15,468,255ns or - 0.015 seconds.. ; . . 

25 Step 0: Advance to the. next dot column 

In order to advance to : the :next 'column of dots • we add 
Arow and Acolxmn to the dotColumnTop to give us the 
centroid of the dot at the top of the colxomn. The first time 
we do this, we are currently at the clock marks column 27 6 

30 to the left of the bit image, and so we advance to the first 
column of data. Since Arow and Acolumn refer to distance 
between dots within a column, to move between dot columns it 
is necessary to add Arow to columndotcoiumnTop and Acolumn to: 

rOWdotColumnTop - 

35 To keep track of what column number is being processed. 



- 89 - 

they column number is recorded in a register called 
CurrentColumn. Every time the sensor advances to the next 
dot column it is necessary to increment the CurrentColumn 
register. The first time it is incremented, it is 
incremented from -1 to 0 (see Step 0 Phase 1) , The 
CurrentColumn register determines when to terminate the read 
process (when reaching maxColumns) , and also is used to 
advance the DataOut Pointer to the next column of byte 
information once all 8 bits have been written to the byte 
(once every 8 dot columns) • The lower 3 bits determin what 
bit we're up to within the current byte. It will be the same 
bit being written for the whole column. 

Step 1; Detect the top and bottom of an Artcard dot 

column. 

In order, to process a/dot column from an i\rtcard, it is 
necessary to detect the top and bottom of a column. The 
column should form a straight line between the top and 
bottom of the column (except for local warping etc) . 
Initially dotColumnTop points to the clock mark column 27 6, 
we simply toggle the expected value, write it out into the 
bit history, and move on to step 2, whose first task will be 
to add the Arow and Acolumn values to dotColumnTop to 
arrive at the first data dot of the coliomn. 

Step 2: Process an .Artcard' s . dot -col - 
Given the centroids of. the.. top and :bottom of a column 
in. pixel coordinates the -column should form a straight line 
between them, , with possible minor variances due to warping 
etc . 

Assuming the processing is to start at the top of a 
coliamn (at the top centroid coordinate) and move down to the 
bottom of the column, subsequent expected dot centroids are 
given as : 

J^ownext = row + Arow 

columnnext = column + Acolumn 

This gives us the address of the expected centroid for 
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the next dot of the column. However to account for local 
warping and error we add another Arow and Acolumn based on 
the last time we found the dot in a given row. In this way 
we can account for small drifts that accumulate into a 
5 maximum drift of some percentage from the straight line 
joining the top of the column to the bottom. 

We therefore keep 2 values for each row, but store them 
in separate tables since the row history is used in ' step 3 
of this phase . 

10 Arow and Acolumn (2 @ 4 bits each = 1 byte) 

* row history (3 bits per row, 2 rows are stored per 

byte) 

For each row we need to read a Arow and Acoliamn to 
determine the change to the centroid. The read process takes 
15 5% of the bandwidth and 2 cache lines: 

76* (3150/32) + 2*3150 = 13,824ns = 5% of bandwidth 
Once the centroid has been determined, the pixels 
around the centroid need to be examined to detect the status 
of the dot and hence the value of the bit. In the worst case 
20 a dot covers a 4x4 pixel area. However, . thanks to the fact 
that we are sampling at 3 times the resolution of the dot, 
the number of pixels required to detect the status . of the 
dot. and: hence the bit value is much less than this. We only 
require access to 3 coliamns of pixel columns . at any one 
25 time. 

In the worst case of pixel drift due to a.1% rotation, 
centroids will shift 1 column every 57 pixel rows, but since 
a dot is 3 pixels in diameter, a given column will be valid 
for 171 pixel rows (3*57) . As a byte contains 2 pixels, the 

30 number of bytes valid in each buffered read (4 cache lines) 
will be a worst case of 86 (out of 128 read) . 

Once the bit has been detected it must be written out 
to DRAM. We store the bits from 8 columns as a set of 
contiguous bytes to minimise DRAM delay. Since all the bits 

35 from a given dot column will correspond to the next bit 
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position in a data byte, we can read the old value for the 
byte, shift and OR in the new bit, and write the byte back. 

The read / shift&OR / write process requires 2 cache 
lines. 

We need to read and write the bit history for the given 
row as we update it. We only require 3 bits of history per 
row, allowing the storage of 2 rows of history in a single 
byte. The read / shift&OR / write process requires 2 cache 
lines. 

The total bandwidth required for the bit detection and 
storage is summarized in the following table: 



Read centroid A 


5% 


Read 3 columns of pixel data 


19% 


Read/Write detected bits into byte 
buffer 


10% 


Read/Write bit history 


5% 






TOTAL 


39% 



Detecting a dot 

The process of detecting the value of a dot (and hence 
the value of a bit) given a centroid is. accomplished by 
examining 3 pixel values and getting the result from a 
lookup table. The process is fairly simple and is 
illustrated in Fig. 34. A dot 290 has a radius of 1.5 
pixels. Therefore the pixel 291 that holds the centroid, 
regardless of the actual position of the centroid within 
that pixel, should be 100% of the dot's value. If the 
centroid is exactly in the center of the pixel 291, then the 
pixels above 292 & below 293 the centroid' s pixel, as well 
as the pixels to the left 294 & right 295 of the centroid' s 
pixel will contain a majority of the dot's value. The 
further a centroid is away from the exact center of the 
pixel 295, the more likely that more than the center pixel 
will have 100% coverage by the dot. 
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Although Fig. 34 only shows centroids differing to the 
left and below the center, the same relationship obviously 
holds for centroids above and to the right of center, 
center. In Case 1, the centroid is exactly in the center of 
the middle pixel 295, The center pixel 295 is completely 
covered by the dot, and the pixels above, below, left, and 
right are also well covered by the dot. In Case 2, the 
centroid is to the left of the center of the middle pixel 
291. The center pixel is still completely covered by the 
dot, and the pixel 294 to the left of the center is now 
completely covered by the dot. The. pi-xesl above 292 and 
below 293 are still well covered. In Case 3, the centroid 
is below the center of the middle pixel 291. The center 
pixel 291 is still completely covered by the dot 291, and 
the pixel ^below . center is now completely covered by the dot. 
The pixels left 294 and right 295 of center are still well 
covered. In Case 4, the centroid is left and below the 
center of the middle pixel. The center pixel 291 is still 
completely covered by the dot, and both the pixel to the 
left of center 294 and the pixel below center 293 are 
completely covered by the dot. 

The algorithm for updating the centroid uses the 
distance of the centroid from the center of the middle pixel 
291 in order to select.. 3; representative^ -^^^ and thus 

decide the value of the dot: 

Pixel 1: the pixel- containing the centroid 

Pixel 2: the pixel to the left of Pixel 1 if the 
centroid' s X coordinate (colxamn value) is < ^, otherwise the 
pixel to the right of Pixel 1. 

Pixel 3: the pixel above pixel 1 if the centroid' s Y 
coordinate (row value) is < ^, otherwise the pixel below 
Pixel 1. 

As shown in Fig. 35, the value of each pixel is output 
to a precalculated lookup table 301. The 3 pixels are fed 
into a 12-bit lookup table, which outputs a single bit 
indicating the value of the dot - on or off. The lookup 
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table 301 is constructed at chip definition time, and can be 
compiled into about 500 gates. The lookup table can be a 
simple threshold table, with the exception that the center 
pixel (Pixel 1) is weighted more heavily, 

5 Step 3: Update the centroid As for each row in the 

column 

The idea of the As processing is to use the previous 
bit history to generate a ^perfect' dot at the expected 
centroid location for each row in a current column. The 

10 actual, pixels (from the CCD) are compared : with the expected 
'perfect' pixels. If the two match, then the actual centroid 
location must be exactly in the expected position, so the 
centroid As must be valid and not need updating. Otherwise a 
process of changing the centroid As needs to occur in order 

15 to best fit the expected centroid location to the actual 
data. The new centroid As will be used for processing the 
dot in the next column. 

Updating the centroid As is done as a subsequent 
process from Step 2 for the -following . reasons : 

20 to reduce complexity in design, so that it can be 

performed as Step 2 of Phase 1 there is enough bandwidth 
remaining to allow it to allow reuse of DRAM buffers, and 
to ensure that all . the data required for ^-centroid updating 
is available at the start of the process without special 

25 pipelining. 

The centroid A are processed as Acolumn Arow 
respectively to reduce complexity. 

Although a given dot is 3 pixels in diameter, it is 
likely to occur in a 4x4 pixel area. However the edge of one 
30 • dot will as a result be in the same pixel as the edge of the 
next dot. For this reason, centroid updating requires more 
than simply the information about a given single dot. 

Fig, 36 shows a single dot 310 from the previous column 
with a given centroid 311. In this example, the dot 310 
35 extend A over 4 pixel columns 312-315 and in fact, part of 
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the previous dot column's dot (coordinate = (Prevcolumn, 
Current Row) has entered the current column for the dot on 
the current row. If the dot in the current row and column 
was white, we would expect the rightmost pixel column 314 
5 from the previous dot column to be a low value, since there 
is only the dot information from the previous column's dot 
(the current column's dot is white). From, this we can see 
that the higher the pixel value is in this pixel column 315, 
the more the centroid should be to the right Of course, if 

10 the dot to the right was also black, we cannot adjust the 
centroid as we cannot get' information siob-pixel. The same 
can be said for the dots to the left, above and below the 
dot at dot coordinates (PrevColumn, CurrentRow) . 

From this we can say that a maximum of 5 pixel columns 

15 and rows are required. It is . possible to simplify the 
situation by taking the cases of row and column centroid As 
separately, treating them as the same problem, only rotated 
90 degrees. 

Taking the horizontal case first, it is necessary to 

20 change the column centroid As if the expected pixels don't 
match the detected pixels. From the bit history, the value 
of the bits found for the Current Row in the current dot 
column, the previous dot column, and the (previous-1 ) th dot 
column are known. The . expected centroid location is also 

25 known. Using these two pieces- of information, it is possible 
to generate a 20 bit expected bit pattern should the read be 
'perfect' . The 20 bit bit-pattern represents the expecteA 
for each of the 5 pixels across the horizontal dimension. 
The first nybble would represent the rightmost pixel of the 

30 leftmost dot. The next 3 nybbles represent the 3 pixels 
across the center of the dot 310 from the previous column, 
and the last nybble would be the leftmost pixel 317 of the 
rightmost dot (from the current column) . 

If the expected centroid is in the center of the pixel, 

35 we would expect a 20 bit pattern based on the following 



- 95 - 



table: 





HiXpecuecj. pixe±s 


n n n 
U UU 


UUUOO 


001 


OOOOD 


010 


ODFDO 


oil 


ODFDD 


100 


DOOOO 


101 


DOOOD 


110 


DDFDO 


111 


DDFDD 



The pixels to the left and right of the center dot are 
5 either 0 or D depending on whether the bit was a 0 or 1 
respectively. The center three pixels -are either 000 or DFD 
depending on whether the bit was a 0 or 1 respectively. 
These values are based on the physical area taken by a dot 
for a given pixel. Depending on the distance of the centroid 

10 from the exact center of the pixel, we would expect data 
shifted slightly, which really only affects the pixels 
either side of the center pixel. Since there are 16 
possibilities, it is possible to divide the distance from 
the center by 16 and use that amount to shift the expected 

15 pixels. 

Once the 20 bit 5 pixel .expected value has been 
determined it can be compared against the actual pixels 
read. This can proceed by subtracting the expected pixels 
from the actual pixels read on a pixel by pixel basis, and 

20 finally adding the differences together to obtain a distance 
from the expecteA. 

Turning to Fig. 37, there is illustrated one form of 
implementation of the above algorithm which includes a look 
up table 320 which receives the bit history 322 and central 

25 fractional component 323 and outputs 324 the corresponding 
20 bit number which is subtracted 321 from the central pixel 
input 326 to produce a pixel difference 327. 
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This process is carried out for the expected centroid 
and once for a shift of the centroid left and right by 1 
amount in Acolumn . The centroid with the smallest 
difference from the actual pixels is considered to be. the 
5 ^winner' and the Acolumn updated accordingly (which 
hopefully is 'no change' ) . As a result, a Acolumn cannot 
change by more than 1 each dot column. 

The process is repeated for the vertical pixels, and 
Arow is consequentially updated. 

10 There is a large amount of scope .here for. parallelism. 

Depending on the rate of the clock chosen for the ACP unit 
31 these units can.be placed in series (and thus the testing 
of 3 different A could occur in consecutive clock cycles) , 
or. in parallel where all 3 can be tested simultaneously. If 

15 the clock . rate is fast enough, there is less need for 
parallelism. 
Bandwidth utilization 

It is necessary to read the old A of the As, and to 
write them out again. This takes 10% of the bandwidth: 

20 2 * (76(3150/32) + 2*3150) = 27,648ns = 10% of bandwidth 

It is necessary to read the bit history for the given 
row as we update its As. Each byte contains 2 row's bit 
histories, thus taking. 2. 5% of the bandwidth: 
76'( (3150/2) /32) + 2* (3150/2) = 4, 085ns = 2 .5% of bandwidth 

25 In the worst case of pixel drift due to a 1% rotation, 

centroids will shift 1 column every 57 pixel rows, but since 
a dot is 3 pixels in diameter, a given pixel column will be 
valid for 171 pixel rows (3*57). As a byte contains 2 
pixels, the number of bytes valid in cached reads will be a 

30 worst case of 86 (out of 128 read) . The worst case timing 
for 5 columns is therefore 31% bandwidth. 

5 *(( (9450/ (128 * 2)) * 320) * 128/86) = 88, 112ns = 31% of 
bandwidth. 

The total bandwidth required for the updating the 
35 centroid A is summarized in the following table: 
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Read/Write centroid A 


10% 


Read bit history 


2.5% 


Read 5 coliomns of pixel data 


31% 






TOTAL 


43.5% 



Summary of Bandwidth for Phase 2 

The total bandwidth required for the phase 2 is 
5 summarized in the following table: 



step .0 


0% 


Step 1 


0.5% 


Step 2 


39% 


Step 3 


43.5 
.5 


TOTAL 


83% 



The reading of the pixel data from the CCD occurs at 
the same, and uses 13% of available bandwidth. This combines 
10 for a total of 96%. 

Memory usage for Phase 2 : 

The. 2MB. bit-image DRAM area is. read. from and written to 
during Phase 2- processing. The 2MB pixel-data DRAM area is 
read. 

15 The 0.5MB scratch DRT^ area is used for storing row 

data, namely: 



Centroid array 


24bits (16:8) * 2 * 3150 = 
18, 900 byes 


Bit History 
array 


3 bits * 3150 entries (2 per 
byte) = 1575 bytes 



Phase 3 -Unscramble and XOR the raw data 
20 Returning to Fig. 28, the next step in decoding is to 
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unscramble and XQR the raw data. The 2MB byte image, as 
taken from the Artcard, is in a scrambled XORed form. It 
must be unscrambled and re~XORed to retrieve the bit image 
necessary for the Reed Solomon decoder in phase 4. 
5 • Turning to Fig. 38, the unscrambling process 330 takes a 

2MB scrambled byte image 331 and writes an unscrambled 2MB 
image 332, The process cannot reasonably be performed in- 
place, so 2 sets of 2MB areas, are utilised. The scrambled 
data 331 is in symbol block order arranged in a 16x16 array, 

10 with symbol block 0 (334) having all the symbol 0' s from all 
the code words in random order. Symbol block 1 has all the 
symbol I's from all the code words in random order etc. 
Since there are only 255 symbols, the 256*"^ symbol block is 
currently unused, 

15 A .linear feedback shift register is used to determine 

the relationship between the position, within a symbol block 
eg. 334 and what code word eg. .355 it came from. This works 
as long as the same seed is used when generating the 
original Artcard images. The XOR of bytes from alternative 

20 source lines with OxAA and 0x55 respectively is effectively 
free (in time) since the bottleneck .of time is waiting for 
the DRAM to be ready to read/write to non-sequential 
addresses. 

The timing of the ' unscrambling : ■ XOR ~ is 
25 effectively 2MB of random byte-reads, -and -2MB of random 
byte-writes i.e, 2 * (2MB * 7 6ns + 2MB * 2ns) 
327,155,712ns or approximately 0.33 seconds. This timing 
assumes no caching. 
Phase 4 - Reed Solomon decode 
30 This phase is a loop, iterating through copies of the 

data in the bit image, passing them to the Reed-Solomon 
decode module until either a successful decode is made or 
until there are no more copies to attempt decode from. 

The Reed-Solomon decoder used is a core such as LSI 
35 Logic's L64712. 

The L64712 has a throughput of 50Mbits per second 
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(around 6.25MB per second)/ so the time is bound by the 
speed of the Reed-Solomon decoder rather than the 2MB read 
and 1 MB write memory access time (500MB/sec for sequential 
accesses) . The time taken in the worst case is thus 2/6. 25s 
= approximately 0.32 seconds. 
Phase 5 Running the Vark script 

• The overall time taken to read the Artcard 9 and decode 
it is therefore approximately 2.15 seconds. The apparent 
delay to the user is actually only 0.65 seconds (the total 
of Phases 3 and 4), since the Artcard stops moving after 1.5 
seconds . 

Once the Artcard is loaded, the Artvark; script must be 
interpreted. Rather than . run the script immediately, the 
script is only run upon the pressing of the ^Print' button 
13 (Fig.l). Time taken to run the script will vary 
depending on the complexity of the script, and must be taken 
into account for the perceived delay between pressing the 
print button and the actual print button and the actual 
printing. 

Vark Accelerator 79 

The Vark Accelerator (VA) 79 (Fig. 3) is a digital 
processing system that accelerates computationally expensive 
Vark functions. The balance of functions performed in 
software by the CPU^ core .72y- and in- :hardware ^by the Vark 
accelerator 79 which is Amplementatlonv. dependent . The goal 
of the VA. 79 is to assist all- Artcard styles ' to execute in a 
time that does not seem to slow to the user. As CPUs become 
faster and more powerful, the number of functions requiring 
hardware acceleration becomes less and less. The ACP has a 
microcoded ALU sub-system that allows general hardware 
speedup of the following time-critical functions. 

1) Image access mechanisms for general software processing 

2) Image convolver. 

3) Data driven image warper 

4) Image scaling 

5) Image tessellation 
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12) 


Histogram collector 


13) 


CCD image to internal image conversion 


14) 


Construction of image pyramids (used by warper & for 


brushing) 




The following table rsummarizes the time.; taken; for each 


Vark 


operation if implemented in the- ALU model • The method 



of implementing the function using the ALU model is 
described hereinafter. 



Operation 


Speed of 
Operation 


1500 * 1000 image 






1 channel 


3 channels 


Image composite 


1 cycle per 
output pixel 


0,015 s 


0.045 s 


Image convo 1 ve 


k/3 cycles per 
output pixel 
(k = kernel 
size) 

3x3 convolve 
5x5 convolve 
7x7 convolve 


0.045 s 
0.125 s 
0.245 s 


0.135 s 
0.375 s 
0,735 s 


Image warp 


8 cycles per 
pixel 


0.120 s 


0.360 s 


Histogram 
collect 


2 cycles per 
pixel 


0.030 s 


0.090 s 


Image 

Tessellate 


1/3 cycle per 
pixel 


0.005 s 


0.015 s 


Image sub-pixel 
Translate 


1 cycle per 
output pixel 






Color lookup 


^ cycle per 


0.008 s 


0.023 
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replace 


pixel 






Color space 
transform 


8 cycles per 
pixel 


0.120 s 


0.360 s 


Convert CCD 
image to 
internal image 
(including 
color convert & 
scale) 


4 cycles per 
output pixel 


0.06 s 


0.18 s 


Construct image 
pyramid 


1 cycle per 
input pixel 


0.015 s 


0.045 s 


Scale 


Maximum of: 
2 cycles per 
input pixel 
2 cycles per 
output pixel 
2 cycles per 
output pixel 
(scaled in X 
only) 


0.015 s 
(minimum) 


0.045 s 
(minimum) 


Affine 
transform 


2 cycles per 
output pixel 


0 , 03 s 


0,09 s 


Brush 

rotate/ translat 
e and composite 








Tile Image 


4-8 cycles per 
output pixel 


0.015 s to 

0 030 s 


0.060 s to 

\J m J. ^ \J O l_ \J 

for 4 
channels 
(Lab, 
texture) 


Illuminate 

image 

Ambient only 
Directional 


Cycles per 

pixel 

H 

1 


.0,008 s 
0.015 s 
0.09 s 


0.023 s 
0.045 s 
0.27 s 
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1 -i <^V> 4- 

X ignu 


b 


0.09 s 


0.27 s 


UXX tiC L.xorio.x 


D 


U . 1 o / S 


0.41s 




Q 


0 , 1 J / S 


0.41 s 


umni JLigriu 


Q 

y 


0,18 s 


0 . 54 s 










Spotlight 








Spotlight 








(bm) 








(bm) 








bumpmap 









For example, to convert a :CCD image, . collect histogram 
& perform lookup-colour replacement (for image enhancement) 
takes: 9+2+0.5 cycles per pixel, or 11.5 cycles. For a 1500 
X 1000 image that is 172,500,000, or approximately 0.2 
seconds per component, or 0.6 seconds for all 3 components. 
Add a simple warp, and the total comes to 0 . 6 + 0.36, almost 
1 second. 
Image Convolver 

A convolve is a weighted average around a center pixel. 
The average may be a simple sum, a sum of absolute values, 
the absolute value of a sum, or sums truncated at 0. 

The image convolver is a general-purpose convolver, 
allowing a variety of functions to ^ be" implemented 'by varying 
the values within a variable-sized. -:coeff i:cient kernel . The 
kernel sizes supported are 3x3, 5x5 and 7x7 only. 

Turning now to Fig. 39, there is illustrated 340 an 
example of the convolution process. The pixel component 
values fed into the convolver process 341 come from a Box 
Read Iterator 342. The Iterator 342 provides the image data 
row by row, and within each row, pixel by pixel. The output 
from the convolver 341 is sent to a Sequential Write 
Iterator 344, which stores the resultant image in a valid 
image format . 

A Coefficient Kernel 34 6 is a lookup table in DR7\M. The 
kernel is arranged with coefficients in the same order as 
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the Box Read Iterator 342, Each coefficient entry is 8 bits* 
A simple Sequential Read Iterator can be used to index into 
the kernel 346 and thus provide the coefficients. It 
simulates an image with ImageWidth equal to the kernel size, 
5 and a Loop option is set so that the kernel would 
continuously be provided. 

One form of implementation of the convolve process is as 
illustrated in Fig. 4 0. The following constants are set^^by 
software; 



Constant 


Value 


Ki 


Kernel size (9, 25, or 49) 



10 

The control logic is used to count down the number of 

multiply/ adds per pixel. When the count (accumulated in 

Latch2) reaches .0, the control signal generated is used to 

write out the current convolve value (from Latchi) and to 

15 reset the count. In this way, one control logic block can be 
used for a number of parallel convolve streams. 

With 3 parallel streams the requirements are summarized as 
follows : 

20 



Requirements 


* + 


+ 


K 


LU 


Iterators 


General (convolve kernel) 


0 


0 


0 


0 


1 


General (per convolve stream) 1 


1 


0 


1 


0 


2 


General (per convolve stream) 2 


1 


0 


1 


0 


2 


General (per convolve stream) 3 


1 


0 


1 


0 


2 


Control logic (one set required) 


0 


1 


2 


0 


0 


TOTAL 


3 


1 


5 


0 


7 



Each cycle the multiply ALU can perform one 
multiply/ add to incorporate the appropriate part of a pixel. 
The number of cycles taken' to sum up all the values is 
25 therefore the number of entries in the kernel. Since this is 
compute bound, it is appropriate to divide the image into 
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multiple sections and process them in parallel. 

On a 7x7 kernel., the time taken for each pixel is 4 9 
cycles, or 490ns, Since each cache line holds 32 pixels, the 
time available for memory access is 12,740ns. ((32-7+1) x 
4 90ns) . The time taken to read 7 cache lines and write 1 is 
worse case 1,120ns (8*140ni5,. all accesses to same DRAM 
bank) , Consequently it is possible to process up to 10 
pixels in parallel given unlimited resources. Given a 
limited number of ALUs it is possible to do at best 4 in 
parallel. The time taken to therefore perform the 
convolution using a 7x7 kernel is 0.18375 seconds (1500*1000 

* 490ns / 4 - 183,750, 000ns) . 

On a 5x5 kernel, the time taken for each pixel is 25 
cycles, or 250ns. Since each cache line holds 32 pixels, the 
time available for memory access is 7,000ns. ((32-5+1) x 
250ns) . The time taken to read 5 cache lines and write 1 is 
worse case 840ns (6 * 140ns, all accesses to same DRAM 
bank) . Consequently it is possible to process up to 7 pixels 
in parallel given, unlimited resources. Given a limited 
number of ALUs it is possible to do at best 4. The time 
taken to therefore perform the convolution using a 5x5 
kernel is 0.09375 seconds (1500*1000 * 250ns / 4 = 
93,750, OOOnsj . 

On a 3x3 kernel, the time taken for ..each ;:pixel is 9 
cycles, or 90ns, Since each cache :line holds ::32' pixels, the 
time available for memory access is 2/700ns, ((32-3+1) x 
90ns) . The time taken to read 3 cache lines and write 1 is 
worse case 560ns (4 + 140ns, all accesses to same DR7\M 
bank) , Consequently it is possible to process up to 4 pixels 
in parallel given unlimited resources. Given a limited 
number of ALUs and Read/Write Iterators it is possible to do 
at best 4, The time taken to therefore perform the 
convolution using a 3x3 kernel is 0,03375 seconds (1500*1000 

* 90ns / 4 = 33,750,000ns). 

Consequently each output pixel takes kernelsize/3 cycles to 
compute. The actual timings are summarized in the following 
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table : 



Kernel 
size 


Time taken to 
calculate 
output pixel 


Time to process 
1 channel at 
1500x1000 


Time to 
Process 

3 channels at 
1500x1000 


3x3 (9) 


3 cycles 


0.045 seconds 


0.135 seconds 


5x5 (25) 


8.1/3 cycles 


0.125 seconds 


0 . 375 seconds 


7x7 (49) 


16 1/3 cycles 


0.245 seconds 


0.735 seconds 



Image Compositor 

5 Compositing is to add a foreground image- to a 

background image using a matte or a channel to govern the 
appropriate proportions of background and foreground in the 
final image. Two styles of compositing are preferably 
supported: regular compositing and associated compositing. 
10 The rules for the two styles are: 

Regular composite: new Value = Foreground + 

(Background - Foreground) a 

Associated composite: new value = Foreground + (1- 

a) Background 

15 The difference then, is that with associated 

compositing, the foreground has been pre-multiplied with the 
matte, while in regular compositing . itr has ::no1::.:: :^ example 
of the compositing process is .as /illustrated. in Fig. 41. 

The a channel has values from 0 to 255 corresponding to 

20 the range 0 to 1 . Thus a regular composite is implemented 
as : 

Regular Composite 

A regular composite is implemented as: 
Foreground + (Background - Foreground) * a/ 255 
25 The division by X/255 is approximated by 257X/65536. 

An implementation of the compositing process is shown in 
more detail in Fig. 42, where the following constant is set 
by software: 
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Constant 


Value 


Ki 


257 



Since 4 Iterators are required, the composite process 
takes 1 cycle per pixel, with a utilization of only half of 
the ALUs. The composite process is only run on a single 
channel. To composite a 3-channel image with another, the 
5 compositor must be run 3 times, once for each channel. 

The time taken- to composite a full size single channel 
is 0.015s (1500 * 1000 * i * 10ns), or 0.045s to composite 
all 3 channels. 

To approximate a divide by 255 it is. possible to 

10 multiply by 257 and then divide by 65536. It can also be 
achieved by a single add (256 * x + x) and ignoring (except 
for rounding purposes) the final 16 bits of the result. 

As . shown in Fig. 41, the compositor process . requires 3 
Sequential Read Iterators 351-353 and 1 Sequential Write 

15 Iterator 355, and is implemented as microcode using 1 Adder 
ALU in conjunction with a multiplier ALU. Composite time is 
1 cycle (10ns) per-pixel. Different microcode is required 
for associated and regular compositing, although the average 
time per pixel composite is the same, 

20 The composite process is only run on a single channel. 

To composite one 3-channel image with another, the 
compositor must be run . 3 -times, . once -for-^each:-channel , As 
the a channel is the , same „ for . each ..composite, - .it must be 
read each time . However it 'should be noted that to trans ^ 

25 (read or write) 4 x 32 byte cache-lines in the best case 
takes 320ns. The pipeline gives an average of 1 cycle per 
pixel composite, taking 32 cycles or 320ns (at lOOMHz) to 
composite the 32 pixels, so the a channel is effectively 
read for free. An entire channel can therefore be composited 

30 in: 

1500/32 * 1000 * 320ns = 15, 040,000ns = 0 . 015seconds , 
The time taken to composite a full size 3 channel image 

is therefore 0,045 seconds. 

Construct Image Pyramid 
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Several functions, such as warping, tiling and 
brushing, require the average value of a given area of 
pixels. Rather than calculate the value for each area given, 
these functions preferably make use of an image pyramid- As 
5 illustrated in Fig. 42, an image pyramid 360 is effectively a 
multi-resolution pixelmap. The original image is a 1:1 
representation. Sub-sampling by 2:1 in each dimension 
.produces, an image H the original size. This process 
continues until the entire image is represented by a single 
10 pixel. 

An image . pyramid is -constructed from an -original image, 
and consumes 1/3 of the size taken up by -the original image 
(1/4 + 1/16 + 1/64 + ...).' For an original image of 1500 x 
1000 the corresponding image pyramid is approximately H MB 

15 The image pyramid is constructed via a 3x3 convolve 

performed on 1 in 4 input image pixels advancing the center 
of the convolve kernel by 2 pixels each dimension. A 3x3 
convolve results in higher accuracy than simply averaging 4 
pixels, and has the added advantage that coordinates on 

20 different pyramid levels differ only by shifting 1 bit per 
level . 

The construction of an entire pyramid relies on a 
software loop that calls the pyramid level construction 
function once for each level of .the pyramids ; . .:. 
25 The timing to produce 1 level of the pyramid is 9/4 * 

.1/ 4 of the resolution of v the input- image ^since we are 
generating an image 1/4 of the size of the original. Thus 
for a 1500 x 1000 image: 

Timing to produce level 1 of pyramid = 9/4 * 750 * 500 
30 = 843, 750 cycles 

Timing to produce level 2 of pyramid = 9/4 * 375 * 250 
= 210, 938 cycles 

Timing to produce level 3 of pyramid = 9/4 * 188 * 125 
= 52, 735 cycles 
35 Etc. 

The total time is 3/4 cycle per original image pixel 
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(image pyramid is 1/3 of original image size, and each pixel 
takes 9/4 cycles to be calculated, i.e. 1/3 * 9/4 = 3/4). In 
the case of a 1500 x 1000 image is 1,125,000 cycles (at 
lOOMHz) , or 0.011 seconds. This timing is for a single 
5 colour channel, 3 colour channels require 0.034 seconds 
processing time. 

General Data Driven Image Warper 

The ACP .31 is able to carry out image warping 
manipulations of the input image. The principles of image 

10 warping are well-known in theory. One thorough text book 
reference on the process ^. of warping . is I' Digital Image 
Warping" by George Wolberg published in. .1950 :by. the IEEE 
Computer Society Press, Los Alamitos, California. The 
warping process utilises a warp map which forms part of the 

15 data fed in via artcard 9.. The warp map can be arbritarily 
dimensioned in accordance with requirements and provides 
infomation of a mapping of input pixels to output pixels. 
Unfortunately, the utilisation of arbritarily sized warp 
maps presents a number of problems which must be solved by 

20 the image warper. 

Turning to Fig 43, a warp map 365, having dimensions 
AxB comprises array values of a certain magnitude (for 
example 8 bit values from 0 - 255) which set out the 
coordinate of a theoretical .input image. 4which, :maps, to the 

25 corresponding ^'theoretical" .output image.; .having the* same 
array coordinate indices . -> -vUnfortunateiy, . any- output image 
eg. 366 will have its own dimensions CxD which may further 
be totally different from an input image which may have its 
own dimensions ExF hence, it is necessary to facilitate the 

30 remapping of the warp map 365 so that it can be utilised for 
output image 366 to deteinnine, for each output pixel, the 
corresponding area or region of the input image 3 67 from 
which the output pixel colour data is to be constructed. 
Hence, for each output pixel in output image 366 it is 

35 necessary to first determine a corresponding warp map value 
from warp map 365. This may include the need to buy 
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linearly interpolate the surrounding warp map values when an 
output image pixel maps to a fractional position within warp 
map Table 365. The result of this process will give the 
location of an input image pixel in a ^"theoretical" image 
5 which will be dimensioned by the size of each data value 
within the warp map 365. These values must be rescaled so 
as to map the theoretical image to the corresponding actual 
input image 3 67. 

In order to: determine the actual value and output image 
10 pixel should take - so as to avoid aliasing effects, adjacent 
output image pixels should; be;:examines:; to. -determine a region 
of input image pixels 367 which willicontrlbute.. to the final 
output image pixel value. In this respect, the image 
pyramid is utilised as will become more apparent 
15 hereinafter. 

The image warper performs several tasks in order to 
warp an image. 

Scale the warp map to match the output image size. 

Determine the span of the region of input image pixels 
20 represented in each output pixel. 

Calculate the final output pixel value via tri-linear 
interpolation from the input image pyramid 
Scale warp map 

As noted previously, . in a : .data . driven -..warp, there is 
25 the need for a warp - map --that . .describes, . for ..jeach output 
pixel, . the center - of . . a t: 'corresponding . input . ~ image map. 
Instead of having a single warp map as previously described, 
containing interleaved x • and y value information, it is 
possible to treat the X and Y coordinates as separate 
30 channels. 

Consequently, preferably there are two warp maps: an X 
warp map showing the warping of X coordinates, and a Y warp 
map, showing the warping of the Y coordinate. As noted 
previously, the warp map 365 can have a different spatial 
35 resolution than the image they being scaled (for example a 
32 X 32 warp-map 365 may adequately describe a warp for a 
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1500 X 1000 image 366) . In addition, the warp maps can be 
represented by 8 or 16 bit values that correspond to the 
size of the image being warped. 

There are several steps involved in producing points in 
5 the input image space from a given warp map: 

1, Determining the corresponding position in the warp map 

for the output pixel 
.2. Fetch the values from the warp map for the next step 
(this can require scaling in the resolution domain if the 
10 warp map is only 8 bit values) 

3 . Bi-linear interpolation of . the. - warp - map to; .determine the 
actual value 

4, Scaling the value to correspond to the input image domain 

The first step can be accomplished by multiplying the 

15 current X/Y coordinate in the output image by. a scale factor 
(which can be different in X & Y) . For example, if the 
output image was 1500 x 1000, and the warp map was 150 x 
100, we scale both X & Y by 1/10. 

Fetching the values from the warp map requires access 

20 to 2 Lookup tables. One Lookup table indexes into the X 
warp-map, and. the other indexes into the Y warp-map. The 
lookup table either reads 8 or 16 bit entries from the 
lookup table, but always returns 16 bit values (clearing the 
high 8 bits if the original . values.: Hre..^on2yj8-..bit^ 

25 The next step in the . ipipeline . .i.s .-to / ,-^bi-linearly 

interpolate the :looked-up' warpmap - values , * 

Finally the result from the bi-linear interpolation is 
scaled to place it in the same domain as the image to be 
warped. Thus, if the warp map range was 0-255, we scale X by 

30 1500/255, and Y by 1000/255. 



The interpolation process is as illustrated in Fig. 44 with 
the following constants set by software: 



Constant 


Value 


Ki 


Xscale (scales 0-ImageWidth to 0-WarpmapWidth) 




Yscale (scales 0-ImageHeight to 0-WarpmapHeight ) 
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K3 


XrangeScale (scales warpmap range (eg 0-255) to 
0-ImageWidth) 




YrangeScale (scales warpiuap range (eg 0-255) to 
0- ImageHeight ) 



The following lookup table is used: 



Lookup 


Size 


Df^tail*^^ 

J_/^ \^ CL -1 L 0 


LUi 


WarpmapWidth 


Warpmap lookup. 


and 


X 


Given [X,Y] the 4 entries required 


LU2 . 


WarpmapHe i gh 


for bi-linear interpolation are 




t 


-returned . Even if entries . are only 8 






bit, :they are -returned as 16 bit 






(high 8 bits 0) . 






Transfer time is 4 entries at 2 






bytes per entry. 






Total time is 8 cycles as 2 lookups 






are used. 


Span calculation 




The 


points from 


the warp map 365 locate centers of 



pixel regions in the input image 3 67. The distance between 
input .image pixels of .adjacent output image pixels will 
indicate the size of the regions, and this distance can be 
approximated via a span calculation, 

-Turning to Fig.45>- for a * given, current .point in the 
warp, map PI , The previous point on .:the same iine is called 
PO, and the previous line's point at the' same position is 
called P2. We determine the absolute distance in X & Y 
between PI and PO, and between PI and P2 . The maximum 
distance in X or Y becomes the span which will be a square 
approximation of the actual shape. 

Preferably, the points are processed in a vertical 
strip output order, PO is the previous point on the same 
line within a strip, and when PI is the first point on line 
within a strip, then PO refers to the last point in the 
previous strip's corresponding line. P2 is the previous 
line's point in the same strip, so it can be kept in a 32- 
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entry history buffer. The basic of the calculate span 
process are as illustrated in Fig. 46 with the details of 
the process as illustrated in Fig. 47. 



The following DRAM FIFO is used: 



Lookup 


Size 


Details 


FIFOi 


8 ImageWidth 
bytes . 

[ImageWidth x 2 
entries at 32 
bits per entry] 


P2 history/lookup (both X & Y 
in same FIFO) 

PI is put into the FIFO and 
taken out again at the same 
pixel on the following row as 
P2. 

Transfer time is 4 cycles 
(2 X 32 bits, with 1 cycle per 
16 bits) 



Since a 32 bit precision span 'history is kept, in the 
case of a 1500 pixel wide image being warped 12,000 bytes 
temporary storage is required. 

Calculation of the span 364 uses 2 Adder ALUs (1 for 
10 span calculation, 1 for looping and counting for PO and P2 
histories) takes 7 cycles as follows: 



Cycle 


Action 


1 


A = ABS (Plx - P2x) 
Store Plx in P2x history 


2 


B = ABS (Plx - POx) 
Store Plx in POx history 


3 


A = MAX (A, B) 


4 


B = ABS (Ply - P2y) 

Store Ply in P2y history 


5 . 


A = MAX (A, B) 


6 


B = ABS (Ply - POy) 

Store Ply in POy history 


7 


A = MAX (A, B) 
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The history buffers 365, 366 are cached DRAM. The 
'Previous Line' (for P2 history) buffer 366 is 32 entries of 
span-precision. The 'Previous Point' (for PO history) . 
Buffer 365 requires 1 register that is used most of the time 
(for calculation of points 1 to 31 of a line in a strip), 
and a DRAM buffered set of history values to be used in the 
calculation of point 0 in a strip's line. 

32 bit precision in span history requires 4 cache lines 
to hold P2 history, and 2 for PO history. PO's history is 
only written and read out once every 8 lines of 32 pixels to 
a temporary storage space :of (ImageHeight*4) :b^^ Thus a 

1500 pixel high image being warped requires 6000 bytes 
temporary storage, and a total of 6 cache lines. 
Tri-linear interpolation 

Having determined the center and span of the area from 
the input image to be averaged, the „ final part of the warp 
process is to determine the value of the output pixel. Since 
a single output pixel could theoretically be represented by 
the entire input image, it is potentially- too time-consuming 
to actually read and average the specific area of the input 
image contributing to the output pixel. Instead, it is 
possible ..to approximate the pixel value by using an image 
pyramid of the input image. 

If the span is 1 or -less, it is necessary .only to read 
the original image's pixels -around .the given v-coordinate, and 
perform bi-linear interpolation. ;^If. the .span ..is greater than 
1, we must read two . appropriate levels of the image pyramid 
and perform tri-linear interpolation. Performing linear 
interpolation between two levels of the image pyramid is not 
strictly correct, but gives acceptable results (it errs on 
the side of blurring the resultant image) . 

Turning to Fig. 48, generally speaking, for a given span 
^s', it is necessary to read image pyramid levels given by 
ln2S 370 and ln2S+l 371. Ln2S is simply decoding the highest 
set bit of s. We must bi-linear interpolate to determine the 
value for the pixel value on each of the two levels 370,371 
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of the pyramid, and then interpolate between 

As shown in Fig. 49, it is necessary to first 
interpolate in X and Y for each pyramid level before 
interpolating between the pyramid levels to obtain a final 
output value 373, 

The image pyramid address mode issued to generate 
addresses for. pixel coordinates at (x, y) on pyramid level s 
& s+1. Each level of the image pyramid contains pixels 
sequential in x. Hence, reads in x are likely to be cache 
hits. 

Reasonable cache coherence . can .^ber ^Dbtained as local 
regions in the output image are typically locally coherent 
in the input image (perhaps at a different scale however, 
but coherent within. the scale) , Since it is not possible to 
know, the .relationship ,between the input and output images, 
we ensure that output pixels are written in a vertical strip 
(via a Vertical-Strip Iterator) in order to best make use^ of 
cache coherence . 

Tri-linear interpolation can be completed in as few as 
2 cycles on average using all 4 multiply ALUs and all 4 
adder ALUs as a pipeline and assuming no memory access 
required. But since all the interpolation values are -derived 
from the image pyramids, interpolation speed is completely 
dependent on cache coherence (not: to :mention~-t^^ units 
are busy doing warp-mapv scaling,. and- - span - As 
many cache lines as possible: should therefore be available 
to the image-pyramid reading. The. best speed will be 8 
cycles, using 2 Multiply ALUs (see the chapter on ALUs for a 
discussion on different algorithms for tri-linear 
interpolation) . 

The output pixels are written out to the DRAM via a 
Vertical-Strip Write Iterator that uses 2 cache lines. The 
speed is therefore limited to a minimum of 8 cycles per 
output pixel. If the scaling of the warp map requires 8 or 
fewer cycles, then the overall speed will be unchanged. 
Otherwise the throughput is the time taken to scale the warp 
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map. In most cases the warp map will be scaled up to match 
the size of the photo. 

Assuming a warp map that requires 8 or fewer cycles per 
pixel to scale, the time taken to convert a single colour 
5 component of image is therefore 0.12s (1500 * 1000 * 8 
cycles * 10ns per cycle) . 
Histogram Collector 

The histogram collector is a microcode program that 
takes an image channel as input, and produces a histogram as 
10 output. Each of a channel's pixels has a value in the range 
0-255. Consequently there .are 256 entries . dn the histogram 
table, each entry 32 bits - large enough to. .contain a count 
of an entire 1500x1000 image. 

As shown in Fig. 50, since the histogram represents a 
15 summary of the entire, image, a Sequential Read Iterator 37 8 
is sufficient for the input. The histogram itself can be 
completely cached, requiring 32 cache lines (IK) . 

The microcode has two passes: an initialization pass 
which sets all the counts to zero, and then a ^'count" stage 
20 that increments the appropriate counter for each pixel read 
from the image , 

The first stage requires the Address Unit and a single 
Adder ALU, with the address of the histogram table 377 for 
initializing. 

25 



Relative 

Microcode 

Address 


Address Unit \ 

A = Base address of 
histogram 


Adder Unit 1- 


0 


Write 0 to 

A + (Adderl.Outl « 
2) 


Outl = A 
A = A - 1 
BNZ 0 


1 


Rest of processing 


Rest of processing 



The second stage processes the actual pixels from the 
image, and uses 4 Adder ALUs: 
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Adder 
1 


Adder 2 


Adder 3 


Adder 4 


Address Unit 


1 


A = 0 






A = -1 




2 

BZ 

2 


Outl = 

A 

A 

pixel 


A 

Adder 1 .Outl 
Z = pixel - 
Adder 1, Outl 


A 

Adr .Outl 


A = A + 1 


Outl = Read 4 
bytes from: (A + 
( Adder 1. Outl « 
2) ) 


3 




Outl = A 


Outl = A 


Outl = A 
A 

Adde.r3.0u 
tl 


Write Adder4 . Outl 
to: (A + (Adder 
2. Out « 2) 


4 










Write Adder4,0utl 
to : (A + (Adder 
2. Out « 2) 
Flush caches 



The Zero flag from Adder2 cycle 2 is used to stay at 
microcode address 2 for as long as the input pixel is the 
same. When it changes, the new count is written out in 
microcode address 3,' and processing resumes at microcode 
address . 2 . .. Microcode . address 4 is used at the end, when 
there are no more. pixels to be read. 

Stage 1 takes 256 cycles, or 2560ns, Stage 2 varies 
according to the values . of the pixels . ;.T^^ time 
for lookup, table replacement: -is 2 cycles ...per-... image pixel if 
every pixel is not the same as its neighbor. The time taken 
for a single colour lookup is 0.03s (1500 x 1000 x 2 cycle 
per pixel x 10ns per cycle = 30, 000, 000ns). The time taken 
for 3 colour components is 3 times this amount, or 0.09s. 
There is no speed gain by combining the 
Color Transform 

Color transformation is achieved in two main ways: 

Lookup table replacement 

Color space conversion 
Lookup Table Replacement 



10 



15 



20 
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The input image is processed simultaneously in two 
halves to make effective use of memory bandwidth. The 
process is as indicated in Fig. 51 and 

As illustrated in Fig, 51, one of the simplest ways to 
5 transform the colour of a pixel is to encode an arbitrarily 
complex transform function into a lookup table 380. The 
■component colour value of the pixel is used to lookup 381 
the new component value of the pixel. For "each pixel read 
from a Sequential Read Iterator, its new value is read from 
10 the New Colour Table 380, and written to a Sequential Write 
Iterator 383. The . - input image can :. be :: processed 
simultaneously in -two halves .to->make - ef.fective =.use of memory 
bandwidth. The following lookup table is used: 



Lookup 


Size 


Details 


LUi 


256 

entries 

8 bits per 

entry 


Replacement [X] 

Table indexed by the 8 highest 
significant bits of X. 
Resultant 8 bits treated as 
fixed point 0 : 8 



15 The total process requires 2 Sequential Read Iterators 

and 2 Sequential Write iterators. The 2 New Colour Tables 
require :8 cache lines each to hold the 256 bytes (256 
entries of 1 byte) . 

The average time for lookup table replacement is 

20.. therefore H cycle per. ' image pixel. The time -taken for a 
single colour lookup is 0.0075s (1500 x 1000 x H cycle per 
pixel X 10ns per cycle = 7,500,000ns). The time taken for 3 
colour components is 3 times this amount, or 0.0225s, Each 
colour component has to be processed one after the other 

25 under control of software. 
Colour Space Conversion 

Colour Space conversion is only required when moving 
between colour spaces. The CCD images are captured in RGB 
colour space, and printing occurs in CMY colour space, while 

30 clients of the ACP 31 likely process images in the Lab 
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colour space. All of the input colour space channels are 
typically required as input to determine each output 
channel' s component value. Thus the logical process is as 
illustrated 385 in Fig, 52. 

Simply, conversion between Lab, RGB, and CMY is fairly 
straightforward. However the individual colour profile of a 
particular device can vary considerably. Consequently, to 
allow future CCDs, inks, and printers, the ACP 31 performs 
colour space conversion by means of tri-linear interpolation 
from colour space conversion lookup tables. 

Colour coherence .'tends /.:to .:be .. area: .. b^ rather than 

line based. To aid cache -coherence during tri-linear 
interpolation lookups, it is ' best to process an image in 
vertical strips. Thus the read 386-388 and write 389 
iterators would be Vertical-Strip Iterators. 
Tri-linear colour space conversion 

For each output colour component, a single 3D table 
mapping the input colour space to the output colour 
component is required. For example, to convert CCD images 
from RGB to Lab, 3 tables calibrated to the physical 
characteristics of the CCD are required: 

RGB->L 

RGB->a 

RGB->b 

To convert from Lab to CMY, 3 tables calibrated to the 
-physical -characteristics of ^the -ink/printer -are- required: 
Lab->C 
Lab->M 
Lab->Y 

The 8-bit input colour components are treated as fixed- 
point numbers (3:5) in order to index into the conversion 
tables. The 3 bits of integer give the index, and the 5 bits 
of fraction are used for interpolation. Since 3 bits gives 8 
values, 3 dimensions gives 512 entries (8x8x8). The size 
of each entry is 1 byte, requiring 512 bytes per table. 
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The Convert Color Space process can therefore be 
implemented as shown in Fig. 53 and the following lookup 



table is 


used: 




Lookup 


Size 


Details 


LUi 


8x8x8 


Convert [X, Y, Z] 




entries 


Table indexed by the 3 highest bits 




512 entries 


of X, Y, and Z. 




8 bits per 


8 entries returned from Tri-linear 




entry 


index address unit 






Resultant 8 bits treated as fixed 






point 8 : 0 






Transf er time is . 8 ^entries at 1 byte 






per entry 



Tri-linear interpolation returns interpolation between 



-5 8 values. .Each .8. bit value takes 1 cycle , to .be returned from 
the lookup, for a total of 8 cycles. The tri-linear 
interpolation also takes .8 cycles when 2 Multiply ALUs are 
used per cycle. General tri-linear interpolation information 
is given in the ALU section of this document. The 512 bytes 

TO for the lookup table fits in 16 cache lines. 

The time taken to convert a single colour component of 
image is therefore 0.105s (1500 * 1000 * 7 cycles * 10ns per 
cycle). To convert 3 components takes 0.415s. Fortunately 
the colour space conversion for printout, takes, place on the 

15 fly during printout itself so .is. not- a ..perceived : delay . 

:lf colour- : Cortponentsp-a:re^:: \converted;= tseparat they 
must not overwrite their input colour space components since 
all colour components from the input colour space are 
required for converting each component. 

20 Since only 1 multiply unit is used to perform the 

interpolation, it is alternatively possible to do the entire 
Lab->CMY conversion as a single pass. This would require 3 
Vertical-Strip Read Iterators, 3 Vertical-Strip Write 
Iterators, and access to 3 conversion tables simultaneously. 

25 In that case, it is possible to write back' onto the input 
image and thus use no extra memory. However, access to 3 
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conversion tables equals 1/3 of the caching for each, that 
could lead to high latency for the overall process. 
Affine Transform 

Prior to compositing an image with a photo, it may be 
necessary to rotate, scale and translate it. If the image is 
only being translated, it can be faster to use a direct sub- 
pixel translation function. However, rotation, scale-up and 
translation can all be incorporated into a single affine 
transform. 

A general affine transform can be included as an 
accelerated, function. Af fine. vtransforms . .are .limited to 2D, 
and if scaling down, input images should be pre-scaled via 
the Scale function. Having a general affine transform 
function allows an output image to be constructed one block 
at a time, and. can reduce the time taken to perform a number 
of transformations on an image since all can be applied at 
the same time . 

A transfoinnation matrix needs to be supplied by the 
client - , the matrix should be the inverse matrix of the 
transformation desired i.e. applying the matrix to the 
output pixel coordinate will give the input coordinate. 

A 2D matrix is usually represented as a 3 x 3 array: 
Since the 3'''^ column is always [0, 0, 1] clients do not 
need to specify it. Clients -instead specif y . a, . b, c, d, e, 
and f. 

Given .a coordinate vin ;!:^^ :(x, : yj whose top 

left pixel coordinate is given as (0, 0) , the input 
coordinate is specified by: (ax + cy + e, bx + dy + f ) . Once 
the input coordinate is determined, the input image is 
sampled to arrive at the pixel value. Bi-linear 
interpolation of input image pixels is used to determine the 
value of the pixel at the calculated coordinate. 

Once . the input coordinate is determined, the input 
image is sampled to arrive at the pixel value by bi-linear 
interpolation. Since affine transforms preserve parallel 
lines, images are processed in output vertical strips of 32 
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pixels wide for best average input image cache coherence. 

3 Multiply ALUs are required to perform the bi-linear 
interpolation in 2 cycles. Multiply ALUs 1 and 2 do linear 
interpolation in X for lines Y and Y+1 respectively, and 
Multiply ALU 3 does linear interpolation in Y between the 
values output by Multiply ALUs 1 and 2. 

As we move to the right across an output line in X, 2 
Adder ALUs calculate the actual input- image coordinates by 
adding 'a' to the current X value, and 'b' to the current Y 
value respectively. When we advance to the next line (either 
the next line in a vertical^ strip :after processing a maximum 
of 32 pixels, or to the first line in a new vertical strip) 
we update X and Y to pre-calculated start- coordinate values 
constants for the given block 

The process for calculating an input coordinate is 
given in Fig. 54 where the following constants are set. by 
software: 



Consta 
nt 


Value 


Ki 


c 


K2 


a 


Kb 


e 




b 


Ks 


d 


Kg 


f 



20 Calculate Pixel 

Once we have the input image coordinates, the input 
image must be sampled. A lookup table is used to return the 
values at the specified coordinates in readiness for 
bilinear interpolation. The basic process is as indicated 

25 in Fig.. 55 and the following lookup table is used: 



Lookup 


Size 


Details 


LUi 


Image 


Bilinear Image lookup [X, Y] 
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width 


Table indexed by the integer part of X and 




by 


Y. 




Image 


4 entries returned from Bilinear index 




height 


address unit, 2 per cycle. 




8 bits 


Each 8 bit entry treated as fixed point 




per 


8:0 




entry 


Transfer time is 2 cycles (2 16 bit 
entries in . FIFO hold the 4 8 bit entries) 



The affine transform requires all 4 Multiply Units and 
all 4 Adder ALUs, and with good cache coherence can perform 
an affine transform with an average of ^2 : cycles per output 
pixel. This timing assumes good cache coherence, which is 



5 true for non- skewed images. Worst case timings are severely 
skewed images, which meaningful Vark scripts are unlikely to 
contain. 

The time taken to transform a 12 8 x 128 image : is 
therefore 0.00033 seconds (32,768 cycles). If this is a clip 

10 image with 4 channels ( including a channel) , the total time 
taken is 0.00131 seconds (131,072 cycles). 

A Vertical-Strip Write Iterator is required to output 
the pixels. No Read Iterator is required. However, since the 
affine transform accelerator is bound by time taken to 

15 access input image pixels, as many cache lines as possible 
should be allocated to the read of -pixels from the input 
image. At least 32 should be available, .and preferably 64 or 
more . 
Scaling 

20 Scaling is essentially a re-sampling of an image. Scale 

up of an image can be performed using the Affine Transform 
function. Generalized scaling of an image, including scale 
down, is performed by the hardware accelerated Scale 
function. Scaling is performed independently in X and Y, so 

25 different scale factors can be used in each dimension. 

The generalized scale unit must match the Affine 
Transform scale function in terms of registration. The 
generalized scaling process is as illustrated in Fig. 56. 
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The scale in X is accomplished by Fant's resampling 

algorithm as illustrated in Fig. 57. 

Where the following constants are set by software: 



Constant 


Value 


Ki 


Niiinber of input pixels that contribute to an 
output pixel in X 


K2 


1/Ki 



5 The following registers are used to hold temporary 
variables : 



Variable 


Value 


Latchi 


Amount of input- pixel remaining unused (starts 
at 1 and decrements) 


Latch2 


T^ount of . input pixels remaining to contribute 
to current output pixel (starts at Ki and 
decrements) 


Latcha 


Next pixel (in X) 


Latch4 


Current pixel 


Latchs 


Accumulator for output pixel (unsealed) 


Latche 


Pixel Scaled in X (output) 



The Scale in Y process is illustrated in Fig. 58 and is also 
accomplished by a slightly altered version of Fant's 
10 resampling algorithm. to account for processing .in ; order of X 
pixels. The implementation is shown here: 



Where the following constants are set by software: 



Constant 


Value 


Ki 


Number of input pixels that contribute to an 
output pixel in Y 


K2 . 


1/Ki 



15 The following registers are used to hold temporary 
variables: 



Variable Value 
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Latchi 


Amount of input pixel remaining unused (starts at 
1 and decrements) 


Latch2 


Amount of input pixels remaining to contribute to 
current output pixel (starts at Ki and 
decrements) 


Latcha 


Next pixel (in Y) 


Latch^ 


Current pixel 


Latchs 


Pixel Scaled in Y (output) 



The following DRAM FIFOs are used: 



Lookup 


Size 


Details 


FIFOi 


ImageWidthouT entries 
8 bits per entry 


1 ^ row of image pixels 
already scaled in X 
1 cycle transfer time 


FIFO2 


ImageWidthouT entries 
16 bits per entry 


1 row of image pixels 
already scaled in X 

2 cycles transfer time (1 
byte per cycle) 



5 Tessellate Image 

Tessellation of an image is a form of tiling. It 
involves copying a specially designed '"tile" multiple times 
horizontally and vertically into a -second . (usually larger) 
image space. When tessellated, the small- tile forms a 
10 seamless picture. One example of this is a small tile of a 
section of a brick wall. It is designed so that when 
tessellated, it forms a full brick wall. Note that there is 
no scaling or sub-pixel translation involved in 
tessellation. 

'^^ The most cache-coherent way to perform tessellation is 

to output the image sequentially line by line, and to repeat 
the same line of the input image for the duration of the 
line. When we finish the line, the input image must also 
advance to the next line (and repeat it multiple times 

20 across the output line) . 
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An overview of the tessellation function is illustrated 
in Fig. 59: 

The Sequential Read Iterator 392 is set up to 
continuously read a single line of the input tile (StartLine 
would be 0 and EndLine would be 1) , Each input pixel is 
written to all 3 of the Write Iterators 393-395. A counter 
397 in an Adder ALU counts down the number of pixels in an 
output line, terminating the sequence at the end of the 
line. 

At the end of processing a line, a small software 
routine updates the Sequential" Read :lterator;'?s -StartLine & 
EndLine registers before restarting the^vmicrocode and the 
Sequential Read Iterator (which clears the FIFO and repeats 
line 2 of the tile) . The Write Iterators 393-395 are not 
updated, and simply keep on writing out to their respective 
parts of the output image. 

The net effect is that the tile has one line repeated 
across an output line, and then the tile is repeated 
vertically too. 

This process does not fully use the memory bandwidth 
since we get good cache coherence in the input image, but it 
does allow the tessellation to function with tiles of any 
size. The process uses 1 Adder ALU. If the 3 Write Iterators 
393-395 each write to 1/3 .of. the image .{.breaking the image 
on tile sized boundaries) , then the entire tessellation 
process takes place at an-'^average^^^speed of 1^^^^^ cycle per 
output image pixel. For an image of 1500 x 1000, this 
equates to ,005 seconds (5,000,000ns). 
Sub-pixel Translator 

Before compositing an image with a background, it may 
be necessary to translate it by a sub-pixel amount in both X 
and Y. Siib-pixel transforms can increase an image's size by 
1 pixel in each dimension. The value of the region outside 
the image can be client determined, such as a constant value 
(e.g. black), or edge pixel replication. Typically it will 
be better to use black. 
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The sub-pixel translation process is as illustrated in 
Fig, 60. Sub-pixel translation in a given dimension is 
defined by: 

Pixelout = Pixelin * ( 1-Translation) + Pixelm-i * 
5 Translation 

It can also be represented as a form of interpolation: 
Pixelout = Pixelin-i + (Pixelin - Pixelm-i)* Translation 
Implementation of a single (on average) cycle 
interpolation engine using a single Multiply ALU and a 
10 single Adder ALU in conjunction is straightforward* Sub- 
pixel translation in both X & y requires - 2- interpolation 
engines. 

In order to sub-pixel translate in Y, 2 Sequential Read 
Iterators 400, 401 are required (one is reading a line ahead 

15 of the other from the same image), and a single Sequential 
Write Iterator 403 is required. 

The first interpolation engine (interpolation in Y) 
accepts pairs of data from 2 streams, and linearly 
interpolates between them. The second interpolation engine 

20 (interpolation in X) accepts its data as a single 1 
dimensional stream and linearly interpolates between values. 
Both engines interpolate in 1 cycle on average. Descriptions 
of interpolators and example microcode for the engines can 
be found in the ALU section of . this document. . 

25 Each interpolation engine 405, 406 is capable of 

performing the sub-pixel , translation :rin -:1 vcycLe output 
pixel on average. The overall time is therefore 1 cycle per 
output pixel, with requirements of 2 Multiply ALUs and 2 
Adder ALUs . 

30 The time taken to output 32 pixels from the sub-pixel 

translate function is on average 320ns (32 cycles) . This is 
enough time for 4 full cache-line accesses to DRT^, so the 
use of 3 Sequential Iterators is well within timing limits. 

The total time taken to sub-pixel translate an image is 

35 therefore 1 cycle per pixel of the output image. A typical 
image to be sub-pixel translated is a tile of size 128 * 
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128, The output image size is 129 * 129. The process takes 
129 * 129 * 10ns = 166,410ns. 

The Image Tiler function also makes use of the sub- 
pixel translation algorithm, but does not require the 
5 writing out of the sub-pixel-translated data, but rather 
processes it further. 
Image Tiler 

The high level algorithm for tiling an image is carried 
out in software. Once the placement of the tile, has been 

10 determined, the appropriate coloured tile must be 
composited. The actual . compositing -:of^:-each .-tile onto an 
image is carried out in hardware via :the rmlcrocoded ALUs. 
Compositing a tile involves' both a texture' application and a 
colour application to a background image. In some cases it 

15 . is desirable to compare the ..actual amount of texture added 
to the background in relation to the intended amount of 
texture, and use this to scale the colour being applied. In 
these cases the texture must be applied first. 

Since colour application functionality and texture 

20 application functionality are somewhat .independent, they are 
separated into sub-functions. 

The number of cycles per 4-channel tile composite for 
the different texture styles and colouring styles is 
sijmmarized in the following table: 

25 





Constant 
colour 


Pixel 
colour 


Replace texture 


4 


4.75 


25% background + tile texture 


4 


4,75 


Average height algorithm 


5 


5.75 


Average height algorithm with 
feedback 


5.75 


6.5 



Tile Colouring and Compositing 

A tile is set to have either a constant colour (for the 
whole tile), or takes each pixel value from an input image. 
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Both of these cases may also have feedback from a texturing 
stage to scale the opacity (similar to thinning paint) . 

The steps for the 4 cases can be summarized as: 
Sub-pixel translate the tile's opacity values. 

Optionally scale the tile's opacity (if feedback frpm 
texture application is enabled) . . 

Determine the colour of the pixel (constant or from an 
image map) . 

Composite the pixel. onto the background image. 

Each of the 4 cases is treated separately, in order to 
minimize the time taken to 'perform the "function. v The summary 
of time per colour compositing .style for . a ..single colour 
channel is described in the following table: 



Tiling color style 


No 


Feedback 




feedback 


from 




from 


texture 




texture 


(cycles 




(cycles 


per 




per 


pixel) 




pixel) 




Tile has constant, color per 


1 


2 


pixel 






Tile has per pixel color 


1.25 


2 


from input image 







Constant colour 

In this case, the tile has a constant colour, 
determined by software. While the ACP 31 is placing down one 
tile, the software can be determining the placement and 
colouring of the next tile. 

The colour of the tile can be determined by bi-linear 
interpolation into a scaled version of the image being 
tiled. The scaled version of the image can be created and 
stored in place of the image pyramid, and needs only to be 
performed once per entire tile operation. If the tile size 
is 128 X 128, then the image can be scaled down by 128:1 in 
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each dimension. 
Without feedback 

When there is no feedback from the texturing of a tile, 
the tile is simply placed at the specified coordinates. The 
5 tile colour is used for each pixel's colour, and the opacity 
for the composite comes from the tile's sub-pixel translated 
opacity channel. In this case colour channels and the 
texture channel can be processed completely independently 
between tiling passes. 

10 The overview of the process is illustrated in Fig. 61. 

Sub-pixel translation 410 ..of a tile . can . be .accomplished 
using 2 Multiply ALUs and 2 .Adder ALUs in .an average time of 
1 cycle per output pixel. The output from the sub-pixel 
translation is the mask to be used in compositing 411 the 

15 constant tile colour 412 with the background image from 
background sequential Read Iterator 41. 

Compositing can be performed using 1 Multiply ALU and 1 
Adder ALU in an average time of 1 cycle per composite. 

Requirements are therefore 3 Multiply ALUs and 3 Adder 

20 ALUs. 4 Sequential Iterators 413-416 are required, taking 
320ns to read or write their contents. With an average 
number of cycles of 1 per pixel to sub-pixel translate and 
composite, there is sufficient time to read and write the 
buffers, 

25 With feedback 

When there is feedback -.from :the texturing :of a tile, 
the tile is placed at the specified coordinates. The tile 
colour is used for each pixel's colour, and the opacity for 
the composite comes from the tile's sub-pixel translated 

30 opacity channel scaled by the feedback parameter. Thus the 
texture values must be calculated before the colour value is 
applied. 

The overview of the process is illustrated in Fig. 62. 
Sub-pixel translation of a tile can be accomplished using 2 
35 Multiply ALUs and 2 Adder ALUs in an average time of 1 cycle 
per output pixel. The output from the sub-pixel translation 
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is the mask to be scaled according to the feedback read from 
the Feedback Sequential Read Iterator 420. The feedback is 
passed it to a Scaler (1 Multiply ALU) 421. 

Compositing 422 can be performed using 1 Multiply ALU 
5 and 1 Adder ALU in an average time of 1 cycle per composite. 

Requirements are therefore all 4 Multiply ALUs and all 

4 Adder ALUs. Although the entire process can be 
accomplished in 1 cycle on average, the bottleneck is the 
memory access, since 5 Sequential Iterators are required. 

10 With sufficient buffering, the average time is 1.25 cycles 
per pixel. 

Colour from Input Image 

One way of colouring pixels in a tile is to take the 
colour from pixels in an input image. Again, there are two 
15 possibilities for compositing: with and without feedback 
from the texturing. 
Without feedback 

In this case, the tile colour simply comes from the 
relative pixel in the input image. The opacity for 
20 compositing comes from the tile's opacity channel sub-pixel 
shifted. 

The overview of the process is illustrated in Fig, 63. 
Sub-pixel translation 425 of a tile can be accomplished 
using 2 Multiply ALUs and 2 .Adder ALUs in . an ^average time of 
25 1 cycle per output pixel. The output .from ^ the sub-pixel 
translation is the . mask to : . be used :±n v.: compos it ing . -,42 6 the 
tile's pixel colour (read from the input image 428 ). with 
the background image 429. 

Compositing 42 6 can be performed using 1 Multiply ALU 
30 and 1 Adder ALU in an average time of 1 cycle per composite. 
Requirements are therefore 3 Multiply ALUs and 3 Adder 
ALUs. Although the entire process can be accomplished in 1 
cycle on average, the bottleneck is the memory access, since 

5 Sequential Iterators are required. With sufficient 
35 buffering, the average time is 1.25 cycles per pixel. 

With feedback 
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In this case, the tile colour still comes from the 
relative pixel in the input image, but the opacity for 
compositing is affected by the relative amount of texture 
height actually applied during the texturing pass. This 
5 process is as illustrated in Fig. 64, 

Sub-pixel translation 431 of a tile can be accomplished 
using 2 Multiply MJJs and 2 Adder ALUs in an average time of 
1 cycle per output pixel. The output from' the sub-pixel 
translation is the mask to be scaled 431 according to the 

19 feedback read from the Feedback Sequential Read Iterator 
432, The feedback is passed to .a Scaler . ;.(.l .,JMultiply ALU) 
431. 

Compositing 434 can be performed ' using 1 Multiply ALU 
and 1 Adder ALU in an average time of.l cycle per composite. 
15 Requirements are therefore all- 4 Multiply ALUs and 3 

Adder ALUs. Although the entire process can be accomplished 
in 1 cycle on average, the bottleneck is the memory access, 
since 6 Sequential Iterators are required. With sufficient 
buffering, the average time is 1,5 cycles per pixel. 

20 Tile Texturing 

Each tile has a surface texture defined by its texture 
channel. The texture must be sub-pixel translated and then 
applied to the output image. There are 3 styles of texture 
compositing: 
25 Replace texture 

25% background + ±ile' s texture 
Average height algorithm 
In addition, the Average height algorithm can save 
feedback parameters for colour compositing. 
30 The time taken per texture compositing style is 

siammarized in the following table: 



Tiling colour style 


Cycles 


Cycles 




per 


per 




pixel 


pixel 




(no 


(f eedbac 
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feedback 

from 

texture) 


k from 
texture) 


Replace texture 


1 




25% background + tile 
texture value 


1 




Average height algorithm 


2 


2 



Replace texture 

In this instance, the texture from the tile replaces 
the texture channel of the. image, as illustrated in Fig. 65. 
5 Sub-pixel translation 436 of a tile's texture can be 
accomplished using 2 Multiply ALUs and 2 Adder 7\LUs in an 
average time of 1 cycle per output pixel. The output from 
this sub-pixel translation is fed directly to the Sequential 
Write Iterator 437. 

10 The time taken for replace texture compositing is 1 

cycle per pixel. Note that there is no feedback, since 100% 
of the texture value is always applied to the background. 
There is therefore no requirement for processing the 
channels in any particular order. 

15 25% Background + Tile's Texture 

In this instance, the texture from the tile is added to 
25% of the existing texture value . The ..new rvalue must be 
greater than or equal, to the original : value In addition, 
the new . texture value must be clipped at 255 since the 

20 texture channel is only 8 bits. The process utilised is 
illustrated in Fig. 66. 

Sub-pixel translation 440 of a tile's texture can be 
accomplished using 2 Multiply ALUs and 2 Adder ALUs in an 
average time of 1 cycle per output pixel. The output from 

25 this sub-pixel translation 440 is fed to an adder 441 where 
it is added to H 442 of the background texture value. Min 
and Max functions 444 are provided by the 2 adders not used 
for sub-pixel translation and the output written to a 
Sequential Write Iterator 445. 
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The time taken for this style of texture compositing is 
1 cycle per pixel. There is no feedback, since 100% of the 
texture value is considered to have been applied to the 
background (even if clipping at 255 occurred) . There is 
therefore no requirement for processing the channels in any 
particular order. 
Average height algorithm 

In this texture application algorithm, the average 
height under the tile is computed, and each pixel's height 
is compared to the average height. If the pixel's height is 
less than the average, the. .stroke ^height: is .added to the 
background height. If the pixel's ..height is greater than or 
equal to the average, then the stroke height is added to the 
average height. Thus background peaks thin the stroke. The 
height is constrained to increase by a minimiam amount to 
prevent the background from thinning the stroke application 
to 0 (the minimum amount can be 0 however) . The height is 
also clipped at 255 due to the 8-bit resolution of the 
texture channel . 

There can be feedback of the difference in texture 
applied versus the expected amount applied. The feedback 
amount can be used as a scale factor in the application of 
the tile's colour. 

In both cases, the average ..texture is. provided by 
software, calculated by performing a ;bi— level-:vinterpolat-ion 
on a scaled version of . :the' : vte^^ would 
determine the next tile's average texture height while the 
current tile is being applied. Software must also provide 
the minimum thickness for addition, which is typically 
constant for the entire tiling process. 
Without feedback 

With no feedback, the texture is simply applied to the 
background texture, as shown in Fig. 67. 

4 Sequential Iterators are required, which means that 
if the process can be pipelined for 1 cycle, the memory is 
fast enough to keep up. 



10 



15 
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Sub-pixel translation 450 of a tile's texture can be 
accomplished using 2 Multiply ALUs and 2 Adder ALUs in an 
average time of 1 cycle per output pixel. Each Min & Max 
function 451,452 requires a separate Adder ALU in order to 
complete the entire operation in 1 cycle. Since 2 are 
already used by the sub-pixel translation of the texture, 
there are not enough remaining for a 1 cycle average time. 

The average time for processing 1 pixel's texture is 
therefore 2 cycles. Note that there is no feedback, and 
hence the colour channel order of compositing is irrelevant . 
With feedback 

This is conceptually the same as the case without 
feedback, except that in addition to the standard processing 
of the texture application algorithm, it is necessary to 
also record the proportion of the texture actually applied. 
The proportion can be used as a scale factor for subsequent- 
compositing of the tile's colour onto the background image. 
A flow diagram is illustrated in Fig, 68. 
The following lookup table is used: 



Lookup 


Size 


Details 


LUi 


256 entries 
16 bits per 
entry 


1/N 

Table indexed by N (range 0-255) 
Resultant 16 bits treated as fixed 
point 0:16 



20 



25 



30 



table 460 is 16 bits, thus requiring 16 cache lines to hold 
continuously. 

Sub-pixel translation 461 of a tile's texture can be 
accomplished using 2 Multiply ALUs and 2 Adder ALUs in an 
average time of 1 cycle per output pixel. Each Min 4 62 & Max 
4 63 function requires a separate Adder ALU in order to 
complete the entire operation in 1 cycle. Since 2 are 
already used by the sub-pixel translation of the texture, 
there are not enough remaining for a 1 cycle average time. 

The average time for processing 1 pixel's texture is 
therefore 2 cycles. Sufficient space must be allocated for 
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the feedback data area (a tile sized image channel) . The 
texture must be applied before the tile's colour is applied, 
since the feedback is used in scaling the tile's opacity. 
CCD Image Interpolator 
5 Images obtained from the CCD via the ISI 83 (Fig. 3) are 

750 X 500 pixels. When the image is captured via the ISI, 
the orientation of the camera is used to rotate the pixels 
by 0, 90, 18 0, or 270 degrees so that the top of the image 
corresponds to ^up' . Since every pixel only has an R, G, or 

10 B colour component (rather than all 3), the fact that these 
have been rotated must be taken . into -account when 
interpreting the pixel values. Depending on the orientation 
of the camera, each 2x2 pixel block has one of the 
configurations illustrated in Fig, 69: 

15 . Several processes need to be performed on the CCD. 

captured image in order to transform it into a useful form 
for processing: 

Up-interpolation of low-sample rate colour components 
in CCD image (interpreting correct orientation of pixels) 

20 Colour conversion from RGB to the internal colour space 

Scaling of the internal space image from 750 x 500 to 1500 x 
1000. 

Writing out the image in a planar format 

The entire channel, of an image is . required to be 
25 available at the same time in order to allow warping. In a 

low memory model (8MB), there: ds -only. -enough .space to hold a 

single channel at full resolution as a temporary object. 

Thus the colour conversion is to a single colour channel. 

The limiting factor on the process is the colour conversion, 
30 as it involves tri-linear interpolation from RGB to the 

internal colour space, a process that takes 0.026ns per 

channel (750 x 500 x 7 cycles per pixel x 10ns per cycle = 

26,250, 000ns) . 

It is important to perform the colour conversion before 
35 scaling of the internal colour space image as this reduces 
the number of pixels scaled (and hence the overall process 
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time) by a factor of 4 . 

The requirements for all of the transformations do not 
fit in the ALU scheme. The transformations are therefore 
broken into two phases: 
5 Phase 1: Up-interpolation of low-sample rate colour 

components in CCD image (interpreting correct orientation of 
pixels) 

Colour conversion from RGB to the internal colour space 
Writing out the image in a planar format 
10 Phase 2: Scaling of the internal space image from 750 x 500 
to 1500 X 1000 

Separating out the scale function . implies that the 
small colour converted image must be in memory at the same 
time as the large one. The output from Phase 1 (0.5 MB) can 
.15 be. safely written . to the memory area usually kept for the 
image pyramid (1 MB) . The output from Phase 2 can be the 
general expanded CCD image. Separation of the scaling also 
allows the scaling to be accomplished by the Affine 
Transform, and also allows for a different CCD resolution 
20 that may not be a simple 1:2 expansion. 

Phase 1 : Up-interpolation of low-sample rate colour 
components . 

Each of the 3 colour components (R, G, and B) needs to 
be up interpolated in order for colour ; conversion to take 
25 place for a given pixel. We have 7 cycles :to.,. per form the 
interpolation per. pixel since the: -;colour; conversion takes 7 
cycles. 

Interpolation of G is straightforward and is 
illustrated in Fig. 53. Depending on orientation, the actual 

30 pixel value G . alternates between odd pixels on odd lines & 
even pixels on even lines, and odd pixels on even lines & 
even pixels on odd lines. In both cases, linear 
interpolation is all that is required. Interpolation of R 
and B components as illustrated in Fig. 71 and 72, is more 

35 complicated, since in the horizontal and vertical directions 
As can be seen from the diagrams, access to 3 rows of pixels 



- 137 - 

simultaneously is required, so 3 Sequential Read Iterators 
are required, each one offset by a single row. In addition, 
we have access to the previous pixel on the same row via a 
latch for each row. 
5 Each pixel therefore contains one component from the 

CCD, and the other 2 up- interpolated. When one component is 
being bi-linearly interpolated, the other is being . linearly 
interpolated. Since the interpolation^ factor is a constant 
0.5, interpolation can be calculated, by an add and a shift 1 

10 bit right (in 1 cycle), and bi-linear interpolation of 
factor 0.5 can be calculated by 3 adds -and a' s^^ 2 bits 
right (3 cycles). The total number of .cycles required is 
therefore 4, using a single multiply ALU. 

Fig. 73 illustrates the case for rotation 0 even line 

15 even pixel (EL, EP) , .and odd line odd pixel (OL, OP) and 
Fig. 74 illustrates the case for rotation 0 even line odd 
pixel (EL, OP) , and odd line even pixel (OL, EP) . The other 
rotations are simply different forms of these two 
expressions . 

20 

Color conversion 

Color space conversion from RGB to Lab is achieved 
using the same method as that described in the general Color 
Space Convert function, a. process that : stakes ^-a^ cycles per 
25 pixel. Phase 1 processing, can rbe -described./ wi.t^ reference 
to Fig. 75. 

The up-interpolate of the RGB takes 4 cycles (1 
Multiply ALU) , but the conversion of the color space takes 8 
cycles per pixel (2 Multiply ALUs) due to the lookup 
30 transfer time. 
Phase 2 

Scaling the image 

This phase is concerned with up-interpolating the .image 
from the CCD resolution (750 x 500) to the working photo 
35 resolution (1500 x 1000) . Scaling is accomplished by running 
the Affine transform with a scale of 1:2. The timing of a 
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general affine transform is 2 cycles per output pixel, which 
in this case means an elapsed scaling time of 0.03 seconds. 
Timing Summary 
Illuminate Image 

5 Once an image has been processed, it can be illuminated by 
one or more light sources . Light sources can be: 

1. Directional - is infinitely distant so it casts parallel 
light in a single direction 

2. Omni - casts unfocused lights in all directions. 

10 3. Spot - casts a focused beam of light at a specific target 
point. There is a cone, and penumbra associated with a 
spotlight . 

The scene may also have an associated ^ bump-map to cause 
reflection angles to vary. Ambient light is also optionally 
15 present in an illuminated scene. 

In this description of accelerated illumination, we are 
concerned with illuminating one image channel by a single 
light source. Multiple light sources can be applied tp a 
20 single image channel as multiple passes: one pass per light 
source. Multiple channels can be processed one at a time 
with or without a bump-map. 

The viewing vector V is always . perpendicular to the 
image plane. 

25 The normal surface vector (N) at a pixel is computed 

from the bump-map if . present. .The., default ..normal vector, in 
the absence of a bump-map, is perpendicular to the image 
plane i.e. N = [0, 0, 1] . 

The viewing vector V is always perpendicular to the 

30 image plane i.e. V = [0, 0, 1] . 

For a directional light source, the light source vector 
(L) from a pixel to the light source is constant across the 
entire image, so is computed once for the entire image. For 
35 an omni light source (at a finite distance) , the light 
source vector is computed independently for each pixel. 
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A pixel's reflection of ambient light is computed 
according to: lakaOd 

5 A pixel's diffuse and specular reflection of a light 

source is computed according to the Phong model: 

fattlp[kdOd(N»L) + k303(R*V)'^] 
When the light source is at infinity, the light . source 
intensity is constant across the. image. 
10 Each light source has three contributions per pixel 

Diffuse contribution 
Specular contribution 

The light source can be defined using the following 
variables: 

15 





Distance from light source 


fatt 


Attenuation with distance [fatt = 1 / dL^^] 


R 


Normalized reflection vector [R = 2N(N»L) - L] 


la 


Ambient light intensity 


Ip 


Diffuse light coefficient 


ka 


Ambient reflection coefficient 


kd 


Diffuse reflection coefficient 


.k, . 


Specular reflection coefficient 




Specular color coefficient 


L 


Normalized light source vector 


N 


Normalized surface normal vector 


n 


Specular exponent 


Od 


Object's diffuse color (i.e. image pixel color) 


Os 


Object' s specular color (k^cOd + (1 - k^c) Ip) 


V 


Normalized viewing vector [V = [0, 0, 1]] 



The same reflection coefficients (ka, k^, kd) are used for 
each colour component . 



A given pixel's value will be equal to the ambient 
contribution plus the sum of each light's diffuse and 
20 specular contribution. * 



- 1 40 - 

Sub-Processes of Illumination Calculation 

In order to calculate diffuse and specular 
contributions, a variety of other calculations are required. 
These are calculations of: 

• 1/Vx 

• N 

• L 

• N«L 

• R^V 

• fatt 

• f cp 

Sub-processes are also defined for calculating the 
contributions of: 

• ambient 

• diffuse 

• specular 

The sub-processes can then be used to calculate the 
overall illumination of a light source. Since there are only 
4 multiply ALUs, the microcode for a particular type of 
light source can have sub-processes intermingled 
appropriately for performance. 

Calculation of 1/Vx 

The Vark lighting, model -:uses . vectors.., .ln-:.many cases it 
is important to calculate the inverse of the length of the 
vector for normalization purposes. Calculating the inverse 
of the length requires the calculation of 1/SquareRoot [X] . 

Logically, the process can be represented as a process 
with inputs and outputs as shown in Fig. 76. Referring to 
Fig. 77, the calculation can be made via a lookup of the 
estimation, followed by a single iteration of the following 
function: 

Vn.l = ^ Vn(3 - XVn^) 

The number of iterations depends on the accuracy 
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required. In this case only 16 bits of precision are 
required. The table can therefore have 8 bits of precision, 
and only a single iteration is necessary. The following 
constant is set by software: 



Constant 


Value 




3 



5 The following lookup table is used: 



Lookup 


Size 


Details 


LUi 


256 

entries 

8 bits per 

entry 


1/SquareRoot [X] 

Table indexed by the 8 highest 
significant bits of X. 

Resultant 8 bits treated as fixed 
point 0 : 8 



Overview of Illumination Calculation 
Calculation of N 

N is the surface normal vector. When there is no bump- 
10 map, N is constant. When a biamp-map is present, N must be 
calculated for each pixel. 

No bump-map 

When there is no bump-map, there is a fixed normal N that 
15 has the following properties: 

N = [Xw, Yk, Zn] = [0, 0, 1] 
1 INI I =1 
1/1 INI I = 1 
normalized N = N 

20 These properties can be used instead of specifically 

calculating the normal vector and 1/||N|| and thus optimize 
other calculations . 
With bump-map 

As illustrated in Fig, 78, when a bump-map is present, N 
25 is calculated by comparing bump-map values in X and Y 
dimensions. The following diagram shows the calculationof N 
for pixel PI in terms of the pixels in the same row and 
column, but not including the value at PI itself. The 
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calculation of N is made resolution independent by 
multiplying by a scale factor' (same scale factor in X & Y) . 
This process can be represented as a process having inputs 
and outputs (Zn is always 1) as illustrated in Fig. 79. 
5 As Zn is always 1 . Consequently Xn and Yn are not 

normalized yet (since Zn = 1) . Normalization of N is delayed 
until after calculation of N*L so that there is only 1 
multiply by 1/| |N| | instead of 3. 

An actual process for calculating N is illustrated in 
10 Fig. 80. 

The following constant is set by software: 



Constant 


Value 


Ki 


ScaleFactor (to make N resolution independent) 



Calculation of L 
1 5 Directional lights 

When a light source is infinitely distant, it has an 
effective constant light vector L. L is normalized and 
calculated by software such that: 
L = [Xl, Yl, Zl] 
20 I I L I I = 1 

1/1 |L| I =1 

These properties can :be used -instead of specifically 
calculating the L and 1/ 1 |L| | and thus optimize other 
-calculations. This process is as illustrated in Fig. 81. 
25 Omni lights and Spotlights 

When the light source is notinf initely distant, L is the 
vector from the current point P to the light source PL, 





Since P = [Xp, 




0], 


L i 




L = 


[Xl, 


Yl/ 


Zl] 


30 


Xl = 


Xp - 


XpL 






Yl = 


Yp - 


YpL 






Zl = 


— ZpL 







We normalize Xl, Yl and Zl by multiplying each by 1/ML||. 
The calculation of 1/||L|| (for later use in normalizing) is 
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accomplished by calculating 

V = Xl' + Yl^ + Zl' 
and then calculating V"^^^ 

5 In this case, the calculation of L can be represented as 
a process with the inputs and outputs as indicated in Fig. 
82. 

Xp and Yp are the coordinates of the pixel whose 
illumination is being calculated. Zp is always 0. 
10 The actual process for calculating L can be as set out 

in Fig. 83, 

Where the following constants are set by software: 



Consta 
nt 


Value 


Ki . 




Kz 


YpL 


Ka 


ZpL^ (as Zp is 0) 


K4 


~2pL , 



Calculation of N»L 

15 Calculating . the dot product of vectors N and L is 

defined as: 

XnXl + YnYl + ZnZl 

No bump-map 

When there is no . bump -map ;N; is .-a-:::constant [0, . 0, 1] . 
20 N«L therefore reduces to Zl. 
With b\amp-map 

When there is a bump-map, we must calculate the dot 
product directly. Rather than take in normalized N 
components, we normalize after taking the dot product of a 
25 non-normalized N to a normalized L. L- is either normalized 
by software (if it is constant), or by the Calculate L 
process. This process is as illustrated in Fig, 84. 

Note that Zn is not required as input since it is 
defined to be 1 . However 1/||N1I is required instead, in 
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order to normalize the result. One actual process for 
calculating N«L is as illustrated in Fig. 85. 
Calculation of R^V 

R^V is required as input to specular contribution 
calculations. Since V = [0, 0, 1], only the Z components are 
required. R«V therefore reduces to: 

R*V = 2Zn(N«L) - Zl 

In addition, since the un-normalized Z^ = 1, normalized 
2n = 1/1 INI I 
No bump -map 

The simplest implementation is when .N is ^-constant (i.e. 
no bump-map) . Since N and V are constant, N«L and R^V can be 
simplified: 

V = [0, 0, 1] 

N = [0, 0, 1] 

L = [Xl, Yl, ZJ 

N«L = Zl 

R»V = 2Zh(N«L) - Zl 

= 2Zl - Zl ■ 
= Zl 

When L is constant (Directional light source) , a 
normalized Zl can be supplied by software in the form of a 
constant whenever R«V is required. When. L .varies (Omni 
lights and Spotlights) , .^normalized : Zl .must -.be- .calculated on 
the fly. It is obtained as output from the Calculate L 
process . 
With bump-map 

When N is not constant, the process of calculating R^V 
is simply an implementation of the generalized formula: 

R^V = 2Zn(N*L) - Zl 
The inputs arid outputs are as shown in Fig. 8 6 with the an 
actual implementation as shown in Fig. 87. 
Calculation of Attenuation Factor 
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When a light source is infinitely distant, the 
intensity of the light does not vary across the image. The 
attenuation factor f^tt is therefore 1. This constant can be 
used to optimize illumination calculations for infinitely 
distant light sources. 
Omni lights and Spotlights 

When a light source is not infinitely distant, the 
intensity of the light can vary according to the following 
formula : 

fatt = f 0 + fi/d + fz/d^ 

Appropriate settings of coefficients fo, fi, and fz 
allow light intensity to be attenuated by a constant, 
linearly with distance, or by the square of the distance. 

Since d = l|L||, the calculation of f^tt can be 
represented as a process with the following inputs and 
outputs as illustrated in Fig, 88. 

The actual process for calculating f^tt can be defined 
in Fig. 89. 

Where the following constants are set by software: 



Consta 
nt 


Value 


Ki 


F2 




fl 


K3 





Directional lights and Omni lights 

These two light sources are not focused, and therefore 
have no cone or penumbra. The cone-penumbra scaling factor 
fcp is therefore 1. This constant can be used to optimize 
illumination calculations for Directional and Omni light 
sources . 
Spotlights 

A spotlight focuses on a particular target point (PT) . 
The intensity of the Spotlight varies according to whether 
the particular point of the image is in the cone, in the 
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penumbra, or outside the cone /penumbra region. 

Turning now to Fig. 90, there is illustrated a graph of 
fcp with respect to the peniombra position. Inside the cone 
47 0, fcp is 1, outside 471 the penumbra fcp is 0. From the 
5 edge of the cone through to the end of the penumbra, the 
light intensity varies according to a cubic function 472. 

The various vectors for penumbra 47 5 and cone 47 6 
calculation are as illustrated in Fig. 90 and 91. 

Looking at the surface of the image in 1 dimension as 
10 shown in Fig. 91, 3 angles A, B, and C are defined. A is the 
angle between the target point .47.9, the . light source 478, 
and the end of the cone 4 80. C is the angle between the 
target point 479, light source 478, and the end of the 
peniombra 481, Both are fixed for a given light source. B is 
15 the angle between the target point 47 9, the light source 
478, and the position being calculated 482, and therefore 
changes with every point being calculated on the image. 

We normalize the range A to C to be 0 to 1, and find 
the distance that B is along that angle range by the 
20 formula: 

(B-A) / (C-A) 

The range is forced to be in the range 0 to 1 by. 
truncation, and this value used as a lookup for the cubic 
approximation of fcp. 
25 The calculat ion of fatt can therefore be represented as 

a process with the inputs ;and. outputs as illustrated . in Fig. 
93 with an actual process for calculating fcp is as shown in 
Fig. 94 where the following constants are set by software: 



Constant 


Value 




Xlt 


K2 


Ylt 


K3 


Zlt 


K4 


A 


K5 


1/(C-A) . [MAXNUM if no penumbra] 
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The following lookup tables are used: 



Lookup 


Size 


Details 


LUi 


64 entries 
16 bits 
per entry 


Arcos (X) 

Units are same as for 

constants K5 and Ke 

Table indexed by highest 6 

bits 

Result by linear interpolation 
of 2 entries 

Timing is 2 * 8 bits * 2 
entries .= 4 ..cycles 


LU2 


64 entries 
16 bits 
per entry- 


Light Response function fcp 
F(l) = 0, F(0) =1, others are 
according to cubic 
Table indexed by 6 bits (1:5) 
Result by . linear interpolation 
of 2 entries 

Timing is 2 * 8 bits = 4 
cycles 



Calculation of Ambient Contribution 

Regardless of the number of lights being applied to an 
5 image, the ambient light contribution is performed once for 
each pixel, and does not. depend on the bump-map 

The ambient calculation process can be represented as a 
process, with the inputs :.and outputs as . illiastrated in 
Fig. 95. The implementation of the process requires 

10 multiplying each pixel from the input image (Od) by a 
constant value daka), as shown in Fig. 96 where the following 
constant is set by software: 



Constant 


Value 


Ki 





Calculation of Diffuse Contribution 
15 Each light that is applied to a surface produces a 

diffuse illumination. The diffuse illumination is given by 
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the formula: 

diffuse = kdOd(N»L) 
There are .2 different implementations to consider: 
Implementation 1 - constant N and L 

When N and L are both constant (Directional light and 
no bump-map) : 

Therefore : 

diffuse = kdOdZL 



Since Od is the only variable, the actual process for 
calculating the diffuse contribution is \ as illustrated in 
Fig. 97 where the following constant is setby software: 



Constant 


Value 


Ki 


kd{N.L) = kdZL 



Implementation 2 - non-constant N & L 

When either N or L are non-constant (either a bump-map 
or illumination from an Omni light or a Spotlight) , the 
diffuse calculation is performed directly according to the 
formula: 

diffuse = kdOd(N«L) 
The diffuse calculation process can be represented as a 
-process with the inputs as illustrated in .Fig. 98. N*L can 
either be calculated using the Calculate .N#L Process, or is 
provided as' a ; constant . An actual process for calculating 
the diffuse contribution is as shown in Fig. 99 where the 
following constants are set by software: 



Constant 


Value 


Ki 





Calculation of Specular Contribution 

Each light that is applied to a surface produces a 
specular illumination. The specular illumination is given by 
the formula: 
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specular = ksOs(R*V)" 
where O3 = kscOd + (l-k^c)!? 

There are two implementations of the Calculate Specular 
process. 

Implementation 1 - constant N and L 

The first implementation is when both N and L are 
constant (Directional light and no biamp-map) . Since N, L and 
V are constant, N«L and R^V are also constant: 

V = [0, 0, 1] 

N = [0, 0, 1] . 

N«L = Zl 

R*V = 2Zn(N«L) - Zl 

= 2Zl - Zl 
= Zl 

The specular calculation can thus be reduced to: 

specular = k^Os ZiT 

= k3ZL"(kscOd + (l-k3c)Ip) 
= k^kscZL^Od + (l-ksc) Ipk^ZL^ 

Since only Od is a variable in the specular 
calculation, the calculation of the specular contribution 
can: therefore be represented as a process with the inputs 
and outputs as indicated in Fig. iOO and an .actual process 
for calculating the specular /.contribution: J.;s-^^ in 
Fig. 101 where the following constants are set by software: 



Constant 


Value 


Ki 


kskscZL 


K2 


(l~ksc) IpksZL^ 


Implementation 


2 


- non constant and L 



This implementation is when either N or L are not 
constant (either a bump-map or illumination from an Omni 
light or a Spotlight) . This implies that R»V must be 
supplied, and hence R«V" must also be calculated. 
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The specular calculation process can be represented as 
a process with the inputs and outputs as shown in Fig 102. 
Fig. 103 shows an actual process for calculating the 
specular contribution where the following constants are set 
5 by software: 



Constant 


Value 


Ki 




K2 




K3 


{l-k3c)lp 



The following lookup table is used: 



Lookup 


Size 


Details 


LUi 


32 


X" 




entries 


Table indexed by 5 highest bits of 




16 bits 


integer R*V 




per 


Result by linear interpolation of 2 




entry 


entries using fraction of R«V. 
Interpolation by 2 Multiplies, 
The time taken to retrieve the data from 
the lookup is 2 * 8 bits * 2 entries = 4 
cycles , 



When ambient light is the only, illumination 

10 If the ambient contribution is the .only, . light source, 

the process is very . .::straightf orward . -- sinee/ it is not 
necessary to add the ambient light to anything with the 
overall process being as illustrated in Fig. 104. We can 
divide the image vertically into 2 sections, and process 

15 each half simultaneously by duplicating the ambient light 
logic (thus using a total of 2 Multiply ALUs and 4 
Sequential Iterators) . The timing is therefore H cycle per 
pixel for ambient light application. 

The typical illumination case is a scene lit by one or 

20 more lights. In these cases, because ambient light 
calculation is so cheap, the ambient calculation is included 
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with the processing of each light source. The first light to 
be processed should have the correct laka setting, and 
subsequent lights should have an laka value of 0 (to prevent 
multiple ambient contributions) , 

If the ambient light is processed as a separate pass 
(and not the first pass), it is necessary to add the ambient 
light to the current calculated value (requiring, a* read and 
write to the same address) . The process overview is shown in 
Fig. 105. 

The process uses 3 Image Iterators, 1 Multiply ALU, and 
takes 1 cycle per pixel on average. 
Infinite Light Source 

In the case of the infinite light source, we have a 
constant light source intensity across the image. Thus both 
L and f att are constant . 
No Bump Map 

When there is no bump-map, there is a constant normal 
vector N [0, 0, 1]. The complexity of the illumination is 
greatly reduced by the constants of N, L, and fatt- The 
process of applying a single Directional light with no bump- 
map is as illustrated in Fig. 105 where the following 
constant is set by software: 



Constant 


Value 


Ki 





For a single infinite:r:light rsourcex^^:^^ to perform 

the logical operations as shown in Fig. 106 where Ki through 
K4 are constants with the following values: 



Constant 


Value 


Ki 


Kd(NsL) = Kd Lz 


K2 


ksc 


K3 


K3(NsH)" = K3 Uz^ 


K4 


Ip 



The process can be simplified since K2, K3, and K4 are 
constants. Since the complexity is essentially in the 
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calculation of the specular and diffuse contributions (using 
3 of the Multiply i\LUs) , it is possible to safely add an 
ambient calculation as the 4^*" Multiply ALU. The first 
infinite light source being processed can have the true 
5 ambient light parameter laka, and all subsequent infinite 
lights can set laka to be 0. The ambient light calculation 
becomes effectively free. 

If the infinite light source is the first light being 
applied, there is no need to include the existing 
10 contributions made by other light sources and the situation 
is as illustrated in -Fig, 107 where the -constants have the 
following values: 



Constant 


Value 


Ki 


kd(LsN) = kdLz 


K4 


Ip 


Ks 


(1- k3(NsH)")Ip = (1 - k3Hz")Ip 


Ke 


k^cksCNsH)" Ip = k3cksHz"Ip 


K7 


laka 



If the infinite light source is not the first light 

15 being applied, the existing, contribution made by previously 
processed lights must be included (the same constants apply) 
and the situation is as illustrated in Fig. 105. 

In the first case 2 Sequential Iterators. 490, 491 are 
required, and in the -^second case/ ::3 Sequential Iterators 

20 490, 491, 492 (the extra Iterator is required to read the 
previous light contributions). In both cases, the 
application of an infinite light source with no bump map 
takes 1 cycle per pixel,, including optional application of 
the ambient light. 

25 With Bump Map 

When there is a bump-map, the normal vector N must be 
calculated per pixel and applied to the constant light 
source vector L. 1/||N|| is also used to calculate R*V, 
which is required as input to the Calculate Specular 2 

30 process. The following constants are set by software: 
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Constant 


Value 


Ki 


Xl 


Kz 


Yl 


K3 


Zl 


K4 





Bump-map Sequential Read Iterator 3 is responsible for 
reading the current line of the bump-map. It provides the 
input for determining the slope in X. Bump-map. Sequential 
Read Iterators 1 and are responsible for reading the line 
above and below the current line. They provide the input for 
determining the slope in Y. 
Omni Lights 

In the case of the Omni light source, the lighting 
vector L and attenuation factor f^tt change for each pixel 
across an image. . Therefore both L and fatt must be calculated 
for each pixel. 
No Bump Map 

When there is no bump-map, there is a constant normal 
vector N [0, 0, 1] , Although L must be calculated for each 
pixel, - both N»L and. R*V are simplified to Zl. When there is 
no bump-map, the application of an Omni light can be 
calculated as shown in Fig. 107 where the following 
constants are set by software: 



Constant 


Value 


Ki 


Xp 


K2 


Yp 


K3 


Ip 



The algorithm optionally includes the contributions 
from previous light sources, and also includes an ambient 
light calculation. Ambient light needs only to be included 
once. For all other light passes, the appropriate constant 
in the Calculate Ambient process should be set to 0. 



The algorithm as shown requires a total of 19 
multiply/accumulates, with 4 Multiply ALUs the task of 
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illuminating a single pixel can be accomplished in a minimum 
of 5 cycles. The times taken for the lookups are 1 cycle 
during the calculation of L, and 4 cycles during the 
specular contribution. The processing time of 5 cycles is 
5 therefore the best that can be accomplished. The time taken 
is increased to 6 cycles in case it is not possible to 
optimally microcode the ALUs for the function. The - speed 
for applying an Omni light onto an image with no associated 
bump-map is 6 cycles per pixel. 

10 With Bump-map 

When an Omni light is applied to an/ -image, ^with ■ an associated 
a bump-map, calculation of N,. L, N«L and R#V are all 
necessary. The process of applying an Omni light onto an 
image with an associated bump-map is as .indicated in Fig, 

15 .108 where the . following constants are set by software.: 



Constant 


Value 


Ki 


Xp 




Yp 


K3 





The., algorithm optionally includes the contributions 
from previous light, sources,- and also ; includes an ambient 
light calculation. Ambient light needs only to be included 
once. For all other, light passes, the appropriate constant 
20 in the Calculate Ambient process . should be >vset. to 0. 

The algorithm as shown requires a total of 32 
multiply/accumulates. With 4 Multiply ALUs the task of 
illuminating a single pixel can be accomplished in a minimum 

25 of 8 cycles. The times taken for the lookups are 1 cycle 
each during the calculation of both L and N, and 4 cycles 
for the specular contribution. However the lookup required 
for N and L are both the same (thus 2 LUs implement the 3 
LUs) . The processing time of 8 cycles is therefore the best 

30 that can be accomplished. The time taken is extended to 9 
cycles in case it is not possible to optimally microcode the 
ALUs for the function. The speed for applying an Omni light 
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onto an image with an associated bump-map is 9 cycles per 
pixel . 

Spotlights 

5 Spotlights are similar to Omni lights except that the 

attenuation factor fatt is modified by a cone/penumbra factor 
fcp that effectively focuses the light around a target. 
No bump-map 

When there is no bump-map, there is a constant normal 
10 vector N [0, 0, 1], Although L must be calculated for each 
pixel, both N«L and R^V -are simplified to : Zl. Fig, 109 
illustrates the application of .a 'Spotlight - to an image where 
the following constants are set by software: 



Constant 


Value 


Ki 


Xp 


K2 


Yp 


K3 


Ip 



The algorithm optionally includes the contributions 



15 from previous light sources, and also includes an ambient 
light .calculation. Ambient light needs only to be included 
once. For all other light passes, the appropriate constant 
in the , Calculate Ambient process should be set to 0. 

The algorithm as shown requires a total of 30 

20 multiply/accumulates. With .-4^ -M^ the task of 

illuminating a single pixel: ^can -be- accomplished in a minimum 
of 8 cycles. The times taken for the lookups are 1 cycle 
during the calculation of L, 4 cycles for the specular 
contribution, and 2 sets of 4 cycle lookups in the 

25 . -Cone/penumbra calculation. The processing time of 8 cycles 
is therefore the best that can be accomplished. The time 
taken is extended to 9 cycles in case it is not possible to 
optimally microcode the ALUs for the function. The speed 
for applying a Spotlight onto an image with no associated 

30 bump-map is 9 cycles per pixel. 
With bump-map 

When a Spotlight is ' applied to an image with an 
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associated a bump-map, calculation of N, L, N«L and ,R«V are 
all necessary. The process of applying a single Spotlight 
onto an image with associated bump-map is illustrated in 
Fig. 110 where the following constants are set by software: 



Constant 


Value 


Ki 


Xp 


K2 


Yp 


K3 





The . algorithm optionally includes the contributions 

from previous light sources, and also includes an . ambient 

light calculation. Ambient light needs only .to be included 
once. For all other light passes, the appropriate constant 

10 in the Calculate Ambient process should be set to 0. 

.The algorithm as shown requires a total of 41 
multiply/accumulates. With 4 Multiply ALUs the task of 
illuminating a single pixel can be accomplished in a minimum 
. of 11 cycles. The times taken for the lookups are 1 cycle 

15 each during the calculation of both L and N, 4 cycles for 
the specular contribution, and 2 sets of 4 cycle lookups in 
the cone/peniimbra calculation. However the lookup required 
for N and L are both the same (thus 4 LUs implement the 5 
LUs) . ..The processing time -of 11 cycles is therefore the best 

20 that can be accomplished. The time taken is extended to 12 
cycles in case it is not.possible : to :optiraally::.microcode the 
ALUs for the function. The speed for applying a Spotlight 
onto an image with "associated bump-map is 12 cycles per 
pixel. 

25 Serial Interfaces 52 (Fig. 3)- USB serial port interface 

This is a standard USB serial port, which is connected 
to the internal chip low speed bus. 
Keyboard interface 55 

This is a standard low-speed serial port, which is 
30 connected to the internal chip low speed bus . 
Authentication chip serial interfaces 64 
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These are 2 standard low-speed serial ports, which are 
connected to the internal chip low speed bus . The reason for 
having 2 ports is to connect to both the on-camera 
Authentication chip, and to the print-roll Authentication 
5 chip using separate lines. Only using 1 line may make it 
possible for a clone print-roll manufacturer to design a 
chip which, instead of generating an authentication code, 
tricks the camera into using the code generated by the 
authentication chip in the camera, 
10 Parallel Interface 65 

The parallel interface connects the ACP ..to- individual, static 
electrical signals. The .following is a table, of connections 
to the parallel interface: 



Connection 


Directi 


Pin 




on 


s 


Paper transport stepper 


Output 


4 


motor 






Artcard stepper motor 


Output 


4 


Zoom stepper motor 


Output 


4 


Guillotine solenoid 


Output 


1 


Flash trigger 


Output 


1 


Status LCD segment drivers 


Output 


7 


Status LCD common drivers 


Output 


4 


Artcard illumination LED 


Output . - 


1 


Artcard status LED 


Input 


2 


(red/green) 






Artcard sensor 


Input 


1 


Paper pull sensor 


Input 


1 


Orientation sensor 


Input 


2 


Buttons 


Input 


4 


Total 




36 



15 Print Head Interface 62 

Once an image has been processed, it can be printed. 
The Print Head Interface connects the ACP to the Print Head, 
providing both data and appropriate signals to the external 



Print Head. 
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Print Head 44 

Fig. Ill illustrates the logical layout of a single 
5 . print Head which logically consists of 8 segments^ each 
printing bi-level cyan, magenta, and yellow onto a portion 
of the page. 

Loading a segment for printing 

Before anything can be printed, each of the 8 segments 
10 in the Print Head must be loaded with 6 rows of data 
corresponding to the . .following- relative ..rows, in the final 
output image : 

Row 0 = Line N, Yellow, even, dots 0, 2, 4, 6, 8, ... 

Row 1 = Line N+8, Yellow, odd dots 1, 3, 5, 7, ... 
15 Row 2 = Line N+10, Magenta, even dots 0, 2, 4, 6, 8, ... 

Row 3 = Line N+18, Magenta, odd dots 1, 3, 5, 7, ... 

Row 4 = Line N+20, Cyan, even dots 0, 2, 4, 6, 8, ... 

Row 5 = Line N+28, Cyan, odd dots 1, 3, 5, 7,. ... 

Each of the segments prints dots over different parts 
20 of the page. Each segment prints 750 dots of one color, 375 
even dots on one row, and 375 odd dots on another. The 8 
segments have dots corresponding to positions: 



Segment 


First dot 


Last dot 


0 


0 


749 


1 


750 


1499. 


2 


1500 


2249 


3 


2250 


2999 


4 


3000 


3749 


5 


3750 


4499 


6 


4500 


5249 


7 


5250 


5999 



Each dot is represented in the Print Head segment by a 
single bit. The data must be loaded 1 bit at a time by 
25 placing the data on the segment's BitValue pin, and clocked 
in to a shift register in the segment according to BitClock. 
Since the data is loaded into a shift register, the order of 
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loading bits must be correct. Data can be clocked in to the 
Print Head at a maximum rate of 10 MHz. 

Once all the bits have been loaded, they must be 
transferred in parallel to the Print Head output buffer, 
ready for printing. The transfer is accomplished by a single 
pulse on the segment's ParallelXf erClock pin. 
Controlling the Print 

In order to conserve power, not all the dots of the 
Print Head have to be printed simultaneously. A set of 
control lines enables the printing of specific dots. An 
external controller, suchr as the yACP, :-can ; change the number 
of dots printed - at once,, as well as . the -duration of the 
print pulse in ; accordance with speed and/or power 
requirements . 

Each segment has 5. NozzleSelect lines, which are 
decoded to select 32 sets of nozzles per row. Since each row 
has 375 nozzles, each set contains 12 nozzles. There are 
also 2 BankEnable lines, one for each of the odd and even 
rows of color. Finally, each segment has 3 . ColorEnable 
lines, one for each of C, M, and Y /colors, A pulse on one of 
the ColorEnable alines causes the specified nozzles of the 
color's specified rows to be printed, A pulse is typically 
about 2[xs in duration. 

If -all the segments. -are : controlled \;byf::;tthe^-s set of 
NozzleSelect, BankEnable and ColorEnable lines (wired 
externally - to the print head) the following is true: 
If both odd and ' even banks print simultaneously (both 
BankEnable bits are set) , 24 nozzles fire simultaneously per 
segment, 192 nozzles in all, consuming 5.7 Watts, 

If odd and even banks print independently, only 12 
nozzles fire simultaneously per segment, 96 in all, 
consuming 2,85 Watts, 
Print Head Interface 62 

The Print Head Interface 62 connects the ACP to the 
Print Head, providing both data and appropriate signals to 
the external Print Head. The Print Head Interface 62 works 
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in conjunction with both a VLIW processor 74 and a software 
algorithm running on the CPU in order to print a photo in 
approximately 2 seconds . 

An overview of the inputs and outputs to the Print Head 
5 Interface is shown in Fig. 112. The Address and Data Buses 
are used by the CPU to address the various registers in the 
Print Head Interface. A single BitClock output line 
connects to all 8 segments on the print head. The 8 DataBits 
lines lead one to each segment, and are clocked in to the 8 

10 segments on the print head simultaneously (on a BitClock 
pulse) . For example, dot 0 is transferred . to segmento, dot 
750 is transferred to segmenti, dot 1500 .to segmenta etc 
simultaneously. 

The VLIW Output FIFO contains the dithered bi-level C, 

15 M, and Y 6000 x 9000 resolution print image in the correct 
order for output to the 8 DataBits. The ParallelXf erClock 
is connected to each of the 8 segments on the print head, so 
that on a single pulse, all segments transfer their bits at 
the same time. Finally, the NozzleSelect, BankEnable and 

20 ColorEnable lines are connected to each of the 8 segments, 
allowing the Print Head Interface to control the duration of 
the C, M, and Y drop pulses as well as how many drops are 
printed with each pulse. Registers in the Print Head 
Interface allow the specif ication of jpulse -durations, between 

25 0 and 6 |a,s, with a typical duration of 2|as. 
Printing an Image 

There are 2 phases that must occur before an image is 
in the hand of the TVrtcam user: 

1. Preparation of the image to be printed 

30 2 . Printing the prepared image 

Preparation of an image only needs to be performed once. 
Printing the image can be performed as many times as 
desired. 

Prepare the Image 
35 Preparing an image for printing involves: 
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1 , Convert the Photo Image into a Print Image 

2, Rotation of the Print Image (internal color space), to 
align the output for the orientation of the printer 

3, Up-interpolation of compressed channels (if necessary) 

5 4. Color conversion from the internal color space to the 

CMY color space appropriate to the specific printer and 
ink 

At the end of image preparation, a 4.5MB correctly 
oriented 1000 x 1500 CMY image is ready to be printed. 
10 Convert Photo Image to Print Image 

The conversion of a :Photo . Image into - a Print Image 
requires the execution of a Vark script to perform image 
'processing. The script is either a default image enhancement 
script or a Vark script taken from the currently inserted 
15 Artcard. The . Vark script is executed via the CPU, 
accelerated by functions performed by the VLIW Vector 
Processor. 

Rotate the Print Image 

The image in memory is originally oriented to be top 

20 upwards. This allows for straightforward Vark processing. 
Before the image is printed, it must be aligned with the 
print roll's orientation. The re-alignment only needs to be 
done once. Subsequent Prints of a Print Image will already 
have been rotated appropriately. 

25 The transformation to be applied, is simply the inverse 

of that applied during: c:apturev from :vt^^ the user 

pressed the ""Image Capture" button on the Artcam. If the 
original rotation was 0, then no transformation needs to 
take place. If the original rotation was +90 degrees, then 

30 the rotation before printing needs to be -90 degrees (same 
as 270 degrees) . The method used to apply the rotation is 
the Vark accelerated Affine Transform function. The Affine 
Transform engine can be called to rotate each color channel 
independently. Note that the color channels cannot be 

35 rotated in place. Instead, they can make use of the space 
previously used for the expanded single channel (1.5MB). 
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Fig. 113 shows an example of rotation of a Lab image 
where the a and b channels are compressed 4:1. The L channel 
is rotated into the space no longer required (the single 
channel area) , then the a channel can be rotated into the 
5 space left vacant by L, and finally the b channel can be* 
rotated. The total time to rotate the 3 channels is 0.09 
seconds. It is :an acceptable period of time to elapse before 
the first print image. Subsequent prints do not incur this 
overhead. 

10 Up Interpolate and color convert 

The Lab image must be converted to -CMY ^b^ printing. 
Different processing occurs depending on whether the a and b 
channels of the Lab image is* compressed. If the Lab image 
is compressed/ the a and b channels must be decompressed 

15 before the color conversion occurs. If the Lab image is not 
compressed, the color conversion is the only necessary, step. 
The Lab image must be up interpolated (if the a and b 
channels are compressed) and converted into a CMY image. A 
single VLIW process combining scale and color transform 

20. The method used to perform the color conversion is the 

Vark accelerated Color Convert function. The Affine 
Transform engine can be called to rotate each color channel 
independently. The color channels cannot be rotated in 
place. Instead, they can make use of the -space previously 

25 used for the expanded single channel (1.5MB). 
Print the Image 

Printing an image is concerned with taking a correctly 
oriented 1000 x 1500 CMY image, and generating data and 
signals to be sent to the external Print Head. The process 

30 involves the CPU working in conjunction with a VLIW process 
and the Print Head Interface. 

The resolution of the image in the Artcam is 1000 x 
1500. The printed image has a resolution of 6000 x 9000 
dots, which makes for a very straightforward relationship: 1 

35 pixel = 6 x 6 = 36 dots. Since each dot is le.epm, the 6x6 
dot square is 100 |im square. Since each of the dots is bi- 
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level, the output must be dithered. 

The image should be printed in approximately 2 seconds. 
For 9000 rows of dots this implies a time of 222 pis time 
between printing each row. The Print Head Interface must 
generate the 6000 dots in this time, an average of 37ns per 
dot. However, each dot comprises 3 colors, so the Print Head 
Interface must generate 'each color component in 
approximately 12ns, or 1 clock cycle of the ACP (10ns at 100 
MHz) . One VLIW process is responsible for calculating the 
next line of 6000 dots to be printed. The odd and even C, 
M, and Y dots are generated by" dithering input from 6 
different 1000 x 1500 CMY image lines. The second VLIW 
process is responsible for taking the previously calculated 
line of 6000 dots, and correctly generating the 8 bits of 
data for ..the 8 segments to be transferred by the Print Head 
Interface to the Print Head in a single transfer. A CPU 
process updates registers in the first VLIW process 1 line 
at a time, an 2 different VLIW processes in order to 
Generate C, M, and Y Dots 

The input to this process is a 1000 x 1500 CMY image 
correctly oriented for printing. The image is not compressed 
in any way. As illustrated in Fig, 115, a VLIW microcode 
program takes the CMY image, and generates the C, M, and Y 
pixels required by the Print Head: Interface \to . be -dithered. 
Generate Merged 8 bit Dot Output 

This process, as illustrated ■ in -Fig, .116, takes a 
single line of dithered dots and generates the 8 bit data 
stream for output to the Print Head Interface via the VLIW 
Output FIFO. The process requires the entire line to have 
been prepared, since it requires semi-random access to most 
of- the dithered line at once. 
Data Card Reader 

Fig. 117, there is illustrated on form of card reader 
500 which allows for the insertion of Artcards 9 for 
reading. Fig, 118 shows an exploded perspective of the 
reader of Fig. 117. Cardreader 500 is interconnected to a 
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computer system and includes a CCD reading mechanism located 
under a covering bar 5. The cardreader 500 includes pinch 
rollers 506, 507 for pinching an inserted Artcard 9, One of 
the roller e.g 506 is driven by an Artcard motor 37 for the 
5 advancement of the card 9 between the two rollers 506 and 
507 at a uniformed speed. The Artcard 9 is passed over a 
series of LED lights 512 which are encased within a clear 
plastic mould 514 having a semi circular cross section. The 
cross section focuses the light from the LEDs eg 512 onto 

10 the surface of the card as it passes by the LEDs 512. From 
the surface it is reflected to a high resolution linear CCD 
34 which is constructed to a resolution of approximately 480 
dpi. The surface of the Artcard 9 is encoded to the level 
of approximately 1600 dpi hence, the near CCD 16 

15 supersamples the T^tcard surface with an approximately three 
times multiplier. The Artcard 9 is further driven at a 
speed such that the linear CCD 516 is able to supersample in 
the direction of Artcard movement at a rate of approximately 
4800 readings per inch. The scanned Artcard CCD data is 

20 forwarded from the Art cardreader 500 to ACP 31 for 
processing. A sensor 49, which can comprise a light sensor 
acts to detect of the presence of the card 13. 

The CCD reader includes a bottom substrate 516, a top 
substrate 514 which comprises a transparent molded plastic. 

25 In between the two substrates is inserted the linear CCD 
array 34 which comprises a thin long linear CCD array 
constructed by means of semi-conductor manufacturing 
processes.' 

Turning to Fig. 119, there is illustrated a side 
30 perspective view, partly in section, of the CCD reader unit. 
The series of LEDs eg. 512 are operated to emit light when a 
card 9 is passing across the surface of the CCD reader 34. 
The emitted light is transmitted through a portion of the 
top substrate 523. The substrate includes a portion eg. 529 
35 having a curved circumference so as to focus light emitted 
from LED 512 to a point eg. 532 on the surface of the card 
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9. The focussed light is reflected from the point 532 
towards the CCD array 34. A series of microlenses eg. 534, 
shown in exaggerated form, are formed on the surface of the 
top substrate 523. The microlenses 523 act to focus light 
5 received across the surface to the focussed down to a point 
536 which corresponds to point on the surface of the CCD 
reader 34 . for sensing of light falling, on the light . sensing 
portion of the CCD array 34. 

A number of refinements of the above arrangement are 

10 possible. For example, the sensing devices on the linear 
CCD 34 may be staggered. ■ The rcorrespondingrjiinicrolenses 34 
can also be correspondingly formed as .to . focus light into a 
staggered series of spots so as to correspond to the 
staggered CCD sensors. 

15 One the data surface area of the Artcard 9 is modulated 

with a checkerboard pattern as previously discussed with 
reference to Fig, 30. Other forms of high frequency 
modulation may be possible however. 

It will be evident that an Artcard printer can be 

20 provided as for the printing out of data on storage Artcard. 
Hence, the Artcard system can.be utilized as a general form 
of information distribution outside of the Artcam device. An 
Artcard printer can prints out Artcards on high quality 
print surfaces and multipie...Artcards.::can:Vxbe5.printed . on same 

25 sheets and later separated.: On a, ..second .surface of the 
Artcard 9 .can be.- printed -vinforinat ion -relating .to the files 
etc. stored on the Artcard 9 for subsequent storage. 

Hence, the Artcard system allows for a simplified form 
of storage which is suitable for use in place of other forms 

30 of storage such as CD ROMs, magnetic disks etc. The 
Artcards 9 can also be mass produced and thereby produced in 
a siibstantially inexpensive for redistribution. 

Turning to Fig. 120, there is illustrated the print 
roll 42 and printhead portions of the Artcam. The 

35 paper/film 611 is fed in a continuous "web-like" process 
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to a printing mechanism 15 which includes further pinch 
rollers 616 - 619 and a print head 44 

The pinch roller 13 is connected to a drive mechanism 
(not shown) and upon rotation of the print roller 613, 
5 paper 611 is forced through the printing mechanism 615 and 
out of the picture output slot 6. A rotary guillotine 
mechanism (not shown) is utilised to cut the roll of paper 
611 at required photo sizes. 

It is therefore evident that the printer roll 42 is 
10 responsible for supplying paper 611 to the print mechanism 
615 for printing of photographically .ima^gedrpictures . 

In Fig. 121, there is shown an exploded perspective 
of the print roll 42. The printer roll 10 includes output 
printer paper 611 which is output under the operation of 
.15 pinching rollers 612, 613. 

Referring now to Fig. 122, there is illustrated a 
more fully exploded perspective view, of the print roll 42 
of Fig. 121 without the ""paper" film roll. The print roll 
42 includes three main parts comprising ink reservoir 
20 section 620, paper roll sections 622, 623 and outer casing 
sections 626, 627 . 

Turning first to the ink reservoir section 620, which 
includes the ink reservoir or ink supply sections 633. 
The ink for printing is - contained ....within;, -three bladder 
25 type containers 630 - 632. The printer roll 42 is assumed 
to provide a full colour: output, inks..^ ^:H^ first ink 

reservoir or bladder container 630 contains cyan coloured 
ink. A second reservoir 631 contains magenta coloured ink 
and a third reservoir 632 contains yellow ink. Each of 
30 the reservoirs 630 - 632, although having different 
volumetric dimensions, are designed to have substantially 
the same volumetric size. 

The ink reservoir sections 621, 633, in addition to 
cover 624 can be made of plastic sections and are designed 
35 to be mated together by means of heat sealing, ultra 
violet radiation, etc. Each of the equally sized ink 
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reservoirs 630 - 632 is connected to a corresponding ink 
channel 639 - 641 for allowing the flow of ink from the 
reservoir 630 - 632 to a corresponding ink output port 35 
- 37. The ink reservoir 632 having ink channel 641, and 
5 output port 37, the ink reservoir 31 having ink channel 
640 and output port 636, and the ink reservoir 30 having 
ink channel 639 and output port 637. 

In operation, the ink reservoirs 630 - 632 can be 
filled with corresponding ink and the section 633 joined 

10 to the section 621. The ink reservoir sections 630 - 632, 
being collapsible bladders-, allow for :.;ink:^::to. traverse ink 
channels 639 - 641 and therefore be in fluid communication 
with the ink output ports 635 - 637'. Further, if required 
an air inlet port can also be provided t:o allow the 

.15 pressure - associated with ink channel reservoirs 630 - 632 
to be maintained as required. 

■ The cap 624 can be joined to the ink reservoir 
section 620 so as to form a pressurised cavity, accessible 
by the' air pressure inlet port. 

20 The ink reservoir sections 621, 633 and 624 are 

designed to be connected together as an integral unit and 
to be inserted inside printer roll sections 622, 623. The 
printer roll sections 622, 623 are designed to mate 
together by means of a snap f it .by . means. portions 

25 645. - 647 mating with coxr.esponding . f emale portions (not 
shown) . Similarly, female- portions .654;.;-.^.^;^^^ 

to mate with corresponding male portions 660 - 662. The 
paper roll sections 622, 623 are therefore designed to be 
snapped together. One end of the film within the role is 

30 print role is pinched between the two sections 622, 623 
when they are joined together. The print can then be 
rolled on the print roll sections 622, 625 as required. 

As noted previously, the ink reservoir sections 620, 
621, 633, 624 are designed to be inserted inside the paper 

35 roll sections 622, 623. The printer roll sections 622, 
623 are able to be rotatable around stationery ink 
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reservoir sections 621, 633 and 624 to dispense film on 
demand. 

The outer casing sections 626 and 627 are further 
designed to be coupled around the print roller sections 
622, 623, In addition to each end of pinch rollers eg 
612, 613 is designed to clip in to a corresponding cavity 
eg 670 in cover 626, 627 with roller 613 being driven 
externally (not shown) to feed the print film and out of 
the print roll. 

Finally, a cavity 677 can be provided in the ink 
reservoir sections 620, 621 for the insertion and gluing 
of an silicon chip integrated circuit type device 79 for 
the storage of information associated with the print roll 
42. 

As shown in Fig. 121, the print roll .42 is designed 
to be inserted into the Artcam camera device so as to 
couple with a coupling unit 680 which includes connecter 
pads 681 for providing a connection with the silicon chip 
53. Further, the connecter 680 includes end connecters of 
four connecting with ink supply ports 635 - 637. The ink 
supply ports are in turn to connect to ink supply lines eg 
682 which are in turn interconnected to printheads supply 
ports eg. 687 for .the flow of ink to printhead 44 in 
accordance with requirements. 

The "media" 611 .oitilised. to .form the roll can 
.comprise many dif f erenth: materviaA^ ^^n: Jwhich '-It is designed 
to print suitable images. For example, opaque rollable 
plastic material may be utilized, transparencies may be 
used by using transparent plastic sheets, metallic 
printing can take place via utilisation of a metallic 
sheet film. Further, fabrics could be utilised within the 
printer roll 42 for printing images on fabric, although 
care must be taken that only fabrics having a suitable 
stiffness or suitable backing material are utilised. 

When the print media is plastic, it can be coated 
with a layer which fixes and absorbs the ink. Further, 
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several types of print media may be used, for example, 
opaque white matte, opaque white gloss, transparent film, 
frosted transparent film, lenticular array film for 
stereoscopic 3D prints, metallised film, film with the 
embossed optical variable devices such as gratings or 
holograms, media which is pre-printed on the reverse side, 
and media which includes a magnetic recording layer. When 
utilising a metallic foil, the metallic foil can have a 
polymer base, coated with a thin (several micron) 
evaporated layer of aluminium or other metal and then 
coated with a clear protective -/layer /adapted to receive 
the ink via the ink printer -mechanism. 

In use the print roll 42 is obviously designed to be 
inserted inside a camera device so as to provide ink and 
paper for the printing of images on demand. The ink 
output ports 635 - 637 meet with corresponding ports 
within the camera device and the pinch rollers 672, 673 
are operated to allow the supply of paper to the camera 
device under the control of the camera device. 

As illustrated in Fig. 122, a mounted silicon chip 53 
is insert. in one end of the print roll 42. In Fig, 123 the 
authentication chip 53 is shown in more detail and includes 
four communications leads 680 - 683 for -communicating 
details from the chip:. 53: to. the -corr^e^ponding camera to 
which it is inserted. 

Turning to. Fig. ^23, -^the • chip /caTi be -^^^ created 
79 by means of encasing a small integrated circuit 687 in 
epoxy and running bonding leads eg. 68 8 to the external 
communications leads 680 - 683, The integrated chip 87 
being approximately 400 microns square with a 100 micron 
scribe boundary. Subsequently, the chip 53 can be glued to 
an appropriate surface of the cavity of the print roll 42. 
In Fig. 124, there is illustrated the integrated circuit 87 
interconnected to bonding pads 81, 82 in an exploded view of 
the arrangement of Fig. 123, 

Referring now to Fig. 125, there is illustrated 
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generally 700 the internal architecture of the chip 53 of 
Fig, 123 The chip architecture 700 includes a flash lueiuory 
store 701, a roll authentication unit 702, a command decoder 
703 and a communications controller 704, 
5 The communications controller 704 is interconnected to 

the serial input and output wires 681, 682 for communication 
with the Artcam. The command coder 703 receives commands 
from the camera 30 via the communications controller 704 
controls the flash memory store 701 and roll authentication 

10 unit 702 to carry out the command. Preferably, the flash 
memory store 701 provides 1/024 bits of- inforTnat ion which 
includes fixed data . written into the flash memory at 
manufacturing time in addition to variable data storage. 

Turning now to Fig. 12 6, there is illustrated 705 the 

15 : information stored within the flash memory store.701. This 
data can include the following: 
Factory Code 

The factory code is a 16 bit code indicating the 
factory at which the print roll was manufactured. This 

20 identifies 'factories belonging to the owner of the print 
roll technology, or factories making print rolls under 
license. The purpose of this number is to allow the 
tracking of factory that a print roll came from, in case 
there are quality problems. 

25 Batch Number 

The batch nuinber -d'S, :a^-32^ the 
manufacturing batch of the print roll. The purpose of this 
nuinber is to track the batch that a print roll came from, in 
case there are quality problems. 

30 Serial Number 

A 48 bit serial nuinber is provided to allow unique 
identification of each print roll up to a maximum of 280 
trillion print rolls. 
Manufacturing date 

35 A 16 bit manufacturing date is included for tracking 

the age of print rolls, in case the shelf life is limited. 
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Media length 

The length of print media remaining on the roll is 
represented by this number. This length is represented in 
small units such as millimetres or the smallest dot pitch of 
5 printer devices using the print roll and to allow the 
calculation of the number of remaining photos in each of the 
well known H, and P formats, as well as other formats 

which, may be printed. the use of small units also ensures - a 
high resolution can be used to maintain synchronisation with 
10 pre-printed media. 
Media Type 

The media type datum enumerates the laedia. contained in 
the print roll. 

( 1 ) Transparent 

(2) Opaque white 

(3) Opaque tinted 

(4) 3D lenticular 

(5) Pre-printed: length specific 

(6) Pre-printed: not length specific 

(7) Metallic foil 

(8) Holographic/optically variable device foil 
Pre-printed Media Length 

The length of the repeat pattern of any pre-printed 
media contained, for ^ example : :on the ■r:back v^^^^ . of the 

25 print roll is stored here. 
Ink Viscosity 

The viscosity of each ink colour is included as an 8 
bit number. the ink viscosity niombers can be used to adjust 
the print head actuator characteristics to compensate for 

30 viscosity (typically, a higher viscosity will require a 
longer actuator pulse to achieve the same drop volume) . 
Recommended Drop Volume for 1200 dpi 

The recommended drop volume of each ink colour is 
included as an 8 bit number. The most appropriate drop 

35 voliame will be dependant upon the ink and print media 
characteristics. For example, the required drop volume will 



15 



20 
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decrease with increasing dye concentration or absorptivity. 

Also, transparent media require around twice the drop volume 

as opaque white media, as light only passes through the dye 

layer once for transparent media . 
5 As the print roll contains both ink and media, a custom 

match can be obtained. The drop volume is only the 

recommended drop volume, as the printer may be other than 

1200 dpi, or . the printer may be adjusted for lighter or 

darker printing. 
10 Ink Colour 

The colour of each of- the dye colours; Is included and 

can be used to "fine tune" the digital .half: . toning that is 

applied to any image before printing. 

Remaining Media Length Indicator 
15 The length of print media remaining on the roll is 

represented by this number and is updatable by the camera 

device. The length is represented in small units (eg. 1200 

dpi pixels) to allow calculation of the number of remaining 

photos in each of C, H, and P formats, as well as other 
20 formats which may be printed. The high resolution can also 

be used to maintain synchronization with pre-printed media. 

Copyright or Bit Pattern 

This 512 bit pattern represents an ASCI character 

sequence sufficient to allow. . the -xzontents v.of the flash 
25 memory store to be copyrightable. 

Authentication Key 

This key includes authentication data to make it 

difficult for third parties to ' reverse engineer the print 

roll technology, 

30 Finally, a further 88 bits are reserved for future 

camera use. 

The role authentication unit 702 as will become more 
apparent hereinafter, takes the authentication key from 
flash memory store 701 and combines it with a print roll 
35 test code received from the camera processor. 
Authentication 
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The print roll manufacturing process includes in-built 
measures to stop illegal clone manufacture in countries with 
weak industrial property protection from copying the 
technology. 

The print rolls 42 are not protected against cloning by 
high technology barriers, such as the extraordinarily 
difficult chemistry of colour silver halide film in 
photographic reproduction. The present embodiment is simply 
constructed from plastic injection moulding, coated paper, 
and ink. the coated paper and ink may only be required to 
be compatible, and do not need to. -match "some special 
formulation. To protect against these problems, an 

authentication code and circuit is included in the print 
roll chip. 

The authentication system prevents illegal copy ing 
which can have the disastrous consequence of ink nozzles 
becoming clogged by poorly filtered ink in "clone" print 
rolls. This will assist in stopping a consumer blaming the 
camera manufacturer and in stopping the spread of 
counterfeit print rolls. 

The authentication system should remain outside most 
countries' legislation in respect of the export of 
cryptographic materials. 

(1) Reverse Engineering of the . Print Roll Chip 

The best way to protect - against reverse engineering of 
the chip is to make, -r thev -^benef it ' of ' reverse ■ engineering 
minimal. To achieve this, the authentication keys are 
stored in non- volatile flash memory. store 101 not in ROM. 

(2) Brute force cryptanalysis . 

Brute force cryptanalysis can be prevented by making 
the authentication key long enough. To be secure against 
computational improvements over the next fifty years, a long 
key is necessary. A key length of 128 bits means that 2^^^ 
tests (3-4 X 10^® tests) must be made to launch a brute force 
attack. This would take ten billion years on an array or a 
trillion processors each running 1 billion tests per second. 
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(3) Substitution with a complete lookup table 

If the number of test codes sent by the camera to the 
print roll is small, then there is no need for a clone 
manufacturer to crack the authentication code. Instead, the 
clone manufacturer could incorporate a ROM in their print 
roll which had a record of all. of the responses from a 
genuine print roll to the codes sent by the camera. In 30 
years, it may be cost effective to build a terabyte ROM into 
each clone print roll. Therefore, the camera should send 
random authentication test words that are at least 4 0 bits 
long. A 128 bit authentication: test, wordrtwill certainly be 
more than adequate. 

(4) Substitution with a sparse lookup table 

If the test codes sent by the camera are somehow 
predictable, rather than effectively random, then .the clone 
manufacturer need not provide a complete lookup table.. For 
example; 

(a) If the test code is simply the serial number of 
the camera, the clone manufacturer need simply provide a 
lookup table which contains values for. past and predicted 
future serial camera serial numbers. There are unlikely to 
be more than lOg of these. 

(b) If the test code is simply the date, then the 
clone manufacturer .can.produce'. a. - lookup =v;t^able-;.using the date 
as the address. 

(c) ,If .the test -:code \ri's - a-Tpseudo"^^^^ using 
either the serial number or the date as a see, then the 
clone manufacturer just needs to crack the pseudo-random 
number generator in the camera. This is probably not 
difficult, as the clone manufacturer may gain access to the 
object code of the camera. The clone manufacturer could 
then produce an content addressable memory (or other sparse 
array lookup) using these codes to access stored 
authentication codes. 

Therefore, long random test keys should be generated by 
a relatively secure process. This random number generator 
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can be in the machine which writes the authentication code 
to the chips . 

(5) Usurping the authentication comparison process 

It must be assumed that a clone manufacturer will have 
access to both the camera and the print roll designs. It 
should not be possible for the clone manufacturer to design 
a- chip which, instead of generating, an authentication code, 
tricks the camera into using the code generated by the 
duplicate authentication chip in the camera. This can be 
achieved by providing separate serial data Tx and Rx lines 
for the camera and print -roll authentication .chips, 

(6) Differential Cryptanalysis 

It is important that the . system adopted is secure 
against differential cryptanalysis. Differential 
cryptanalysis is a well known technique where pairs of input 
streams are generated with known differences, and the 
differences in the encoded streams are analysed. A small 
amount (10^ or so) of weakening could be accepted. 
n) Listening to the data flow between the camera and the 
print roll 

Again a logic analyser can be connected to the data 
stream between the camera and the print roll. In this way, 
the codes sent to the print roll, and the authentication 
reply, can be monitored. However, .: these .codes.; are .128 bit 
pseudo-random numbers, which, are only related by the 
encoding. . algorithm in/r the-jauthentieatioTi.^^^^ :This is 

essentially a known plaintext attack, which is less powerful 
than a chosen plaintext attack. 

(8) Direct viewing of chip operation 

If the chip operation could be directly viewed using an 
STM or an electron beam, the authentication codes could be 
recorded as they are read from the internal non-volatile 
memory and loaded into internal registers on the chip, 

(9) Direct viewing of the non-volatile memory 

If the chip where sliced to that the floating gates of 
the Flash memory were exposed, without discharging them. 
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then the authentication key could probably, be viewed 
directly using an STM. However, slicing the chip to this 
level without discharging the gates is difficult. Using wet 
•etching, plasma etching, ion milling, or chemical mechanical 
5 polishing will almost certainly discharge the small charges 
present on the floating of the chip gates. 

(10) Viewing Idd fluctuations 

Whenever the Authentication key. is being read from 
memory, the Message Authentication Code (MAC) circuitry is 
10 also operating, obscuring the Idd signal. 

Also, after the code .:words have ;been::-.programmed, . a lock 
bit is programmed which, prevents subsequent : .'programming of 
the code words. . This prevents detection of the code words 
by monitoring . the difference in Idd that may occur when 
15 programming over either a high or a low bit. 

(11 ) Bribery and other industrial espionage 

It is not necessary for any human to know, or to be 
able to find out, what the authentication numbers are. 
Therefore, the niombers are safe from bribery or other 

20 corruption. 

There need only be one or a few machines which programs 
the print roll chips, and these machines could be kept 
secure, preventing their theft. Authentication chips may be 
stolen en- route to print roll, factories.,:..but,-:-this. would only 

25 enable the manufacture of .as . many clone print rolls as there 
were chips stolen,: which :would.>-/probably a few 

million in any one shipment- It would not be viable for a 
print roll illegal clone manufacturer to continually steal 
chips . 

30 (12) Reverse engineering the authentication key generator 

If the clone manufacture can obtain the code for the 
authentication key generator, then this could be reverse 
engineered. For maximum security, the Authentication key 
should be truly random. This is simply achieve by flipping 

35 a coin 128 times, and entering the key into the 
authentication chip programmer is a secure environment. 
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This only has to be done once. 

(13) Management decision to omit authentication to save 
costs 

Without any form of protection, illegal cloning is 
5 almost certain. However, with the patent and copyright 
protection, the probability of illegal cloning may be 
reduced to say 50%, 

However, this is not the only loss possible. If a 
clone manufacturer were to introduce clone print rolls which 

10 caused damage to the camera (eg, clogged nozzles), then the 
loss in market acceptance; .and, the.- -.expense of warranty 
repairs, may also be significant. Upon . insertion of a print 
roll, the ACP 31 interrogates a print roll chip 53 in 
addition , to interrogating an exact replica of . the chip 54 

15 stored within the camera system. The print roll chip 53 is 
designed to be fed a print roll test code to which it 
applied a one way hash function to produce a resultant code 
that is checked by the camera processor 105 which also sends 
the same code to its camera authorisation chip 106. 

20 Turning now to Fig. 127, there is illustrated the 

significant steps in the authorisation method of the 
preferred embodiment. Each Artcam is provided with a unique 
random identification code 710. The Artcam processor takes 
the identification code 710. . and ^....curxent.. ...time value 711 

25 from the real time ..clock of the Artcam processor and 
exclusive ORs them .together 712 , r The-rresult -of -this . process 
is utilised as a seed to a random number generator 714 which 
-produces a print roll test code having 128 bits. The Artcam 
processor then transmits the print roll test code to the 

30 Artcam authorisation chip 54 and the print roll 
authorisation chip 53 which each utilises their internally 
stored key via a corresponding roll authentication unit 702 
(Fig, 125) to return to the 7\xtcam processor 31 at stage 719 
the expected output values for the given input value. The 

35 Artcam processor checks to values to assure they are the 
same and accepts or rejects the print roll based on the 
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quality of the two values. 

It will be evident from the forgoing it is crucial 
that the key utilised by the Artcam authorisation chip 54 
and print roll authorisation chip 53 is kept secret. As 
previously noted, the authorization key is stored in the 
flash memory store 709 of the print roll authorisation 
chip. Therefore, an attack by way of reverse engineering 
of the chip will lead to minimal results. One form of 
attack may be to monitor the chips operation utilising a 
scanning tunnelling microscope (STM) or an electron beam 
to monitor the authorisation , codes as -:they are read from 
the internal non-volatile memory and loaded into internal 
registers on the chip. Turning now to Fig. 128, such 
analysis can be circumvented by incorporated a shielding 
metal layer 725, over the lower circuitry, as an extra 
metallisation layer . 

Of course, the attacker may simply chose to wet etch 
the metal layer 725. However, if the metal layer 725 is 
utilised as the ground ' plane for connections within the 
chip circuitry, the metallisation layer, if removed, will 
result in the chip seeking to malfunction, thereby 
preventing reverse analysis. This means the attacker is 
forced to either remove, the metal layer and lay new ground 
connections or to mask the metal layer-Mbe-f ore removal. 
Masking of the metal .layer for removal, is :the easiest of 
these two processes but- . will- still- "be -very r-d^^ . In 

this case, the attacker must: 

(1) reverse engineer the chip to find out where the 
ground connections should be; 

(2) create a mask corresponding to the required ground 
plane pattern connection; 

(3) apply a photo resist to the chip. This will be 
extremely difficult as the individual chip is only 
approximately 400 microns square. Therefore, 
standard semi-conductor processes of applying a photo 
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resist, in particular resist spin processing, cannot 
be utilised; 

(4) soft bake the resist; 

(5) expose the resist. This will again be difficult for 
5 a single chip as modern lithographic equipment is 

designed for a wafer; 

(6) hard bake the resist; and 

(7) etch the top metallisation layer. 

The process of high temperature resist baking will 

10 most likely destroy the charge patterns in the non- 
volatile memory which -holds 'the. -authentication niimbers 
making this process fruitless. 

Further, viewing the date flow in the chip can be 
made even more difficult by making all the connections 

15 from which is possible to view the authentication numbers 
in the polysilicon layer. 

The authentication key should be truly random, to 
prevent compromise by obtaining knowledge of the process 
used to generate the authentication key. A simple way is 

20 for a trusted human to flip a coin 128 times, while 
entering 0 (heads) or 1 (tails) into the keyboard in a 
secure environment. The authentication key need only be 
known by the machine which programs the authentication 
chips (the human coin flipper -will .not. ^-remember It) . So 

25 that this machine cannot, . be stolen, . .;;all ;: authentication 
numbers ,and chips should-' be-'^programmed and 
shipped to different print roll and Artcam manufacturing 
sites. Other data specific to a Artcam or print roll can 
be programmed at this place of manufacture. 

30 Of course, it is necessary to ensure that the 

authentication key is never lost, as this would prevent 
the legitimate manufacture of compatible print rolls. 
Further, the bit pattern preferably contains clearly 
copyrightable material such that the attacker in order to 

35 replicate the operation of the authorisation chip 53 must 
also copy the bit pattern and therefore is likely to 
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infringe copyright laws in various jurisdictions . The bit 
pattern is preferably the original work, of an identifiable 
author reduced to a tangible form. For example, it could 
be a particular image of bits, otherwise it could be a 
5 corresponding ASCII equivalent of prose. Further, it 
should represent the application of knowledge, judgement, 
skill and labour by the author. It should not be an 
integral part of the chip but merely stored in the chips 
memory. Of course, preferably, the copyright ownership of 

10 the bit pattern resides with the print roll manufacturer. 
As an example, the -bit pattern .could be an ASCII 
representation of a short poem. Hence, the . allocation of 
512 bits should be sufficient . Although the bit pattern 
could be stored as ■ ROM on the chips , as these chips 

15 already . have flash memory, the smallest chip size may be 
achieved by implementing the bit pattern in the flash 
memory . 

Turning now to Fig. 129, there is illustrated the 
storage table 730 of the Artcam authorisation chip. The 

20 table includes manufacturing code, batch number and serial 
number and date which have an identical foriaat to that 
previously described. The table 730 also includes 

information 731 on the print engine within the Artcam 
device. The information stored ..can„ include a print engine 

25 type, the DPI resolution of the printer and a printer 
count of the number -of .prints -produced^ by the printer 
device , 

Further, an authentication test key 710 is provided 
which can randomly vary from chip to chip and is utilised 

30 as the Artcam random identification code in the previously 
describe algorithm. The 128 bit print roll authentication 
key 713 is also provided and is equivalent to the key 
stored within the print rolls. Next, the 512 bit pattern 
is stored followed by a 120 bit spare area suitable for 

35 Artcam use. 
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As noted previously, the Artcam preferably includes a 
liquid crystal display 15 which indicates the number of 
prints left on the print roll stored within the Artcam. 
Further, the Artcam also includes a three state switch 17 
which allows a user to switch between three standard 
formats C H and P (classic, HDTV and panoramic) . Upon 
switching between the three states, the liquid crystal 
display 15 is updated to reflect .the number of images left 
on the print roll if the particular format selected is 
used. 

In order to correctly operate . - the ... .liquid crystal 

display, the Artcam processor, upon the ■ insertion of a 
print roll and the passing of the authentication test 
reads the from the flash memory store of the print roll 
chip 53 and determines the amount of paper left. Next, 
the value of the output format selection switch 17 is 
determined by the Artcam processor. Dividing the print 
length by the corresponding length of the selected output 
format the Artcam processor determines the number of 
possible prints and updates tthe liquid crystal display 15 
with the number of prints left. Upon a user changing the 
output format selection switch 17 the Artcam processor 105 
re-calculates the number of output pictures in accordance 
with that -format and .again .updates the .LCD. display 15 . 

The storage of .:-process .information .in the printer 
roll table 705 -also .allows the Artcam device to take 
advantage of changes in process and print characteristics 
of the print roll. 

In particular, the pulse characteristics applied to 
each nozzle within the print head can be altered to take 
into account of changes in the process characteristics. 
Turning now to Fig, 130, the Artcam Processor can be 
adapted to run a software program stored in an ancillary 
memory chip. The software program, a pulse profile 
characteriser 771 is able to read a number of variables 
from the printer roll. These variables include the 
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remaining roll media on printer roll 772, the printer 
media type 773, the ink colour viscosity 774, the ink 
colour drop volume 775 and the ink colour 776. Each of 
these variables are read by the pulse profile 
5 characteriser and a corresponding, most suitable pulse 
profile is determined in accordance with prior trial and 
experiment. The parameters alters the printer pulse 
received by each printer nozzle so as to improve the 
stability of ink output. 
10 It will be evident that the authorization chip 

includes significant . advances in that ... important and 
valuable information is stored on the printer, chip with 
the print roll. This information can include process' 
characteristics of the print roll in question in addition 
15 to information on the. type of print roll and the amount of 
paper left in the print roll. Additionally, the print 
roll interface chip can provide valuable authentication 
information and can be constructed in a tamper proof 
manner. Further, a tamper resistant method of utilising 
20 the chip has been provided. The utilisation of the print 
roll chip also allows a convenient and effective user 
interface to be provided for an immediate output form of 
Artcam device able to output multiple photographic formats 
whilst simultaneously able v to provide, .an -i^^^ of the 

25 number of photographs left in the printing device. 
Print Head Unit 

Turning now to Fig. 131 , there is illustrated an 
exploded perspective view, partly in section, of the print 
head unit 615 of Fig. 120. 

The print head unit 615 is based around the printhead 
4 4 which ejects ink drops on demand on to print media 611 so 
as to form an image. The print media 611 is pinched between 
two set of rollers comprising a first set 618, 616 and 
second set 617, 619. 

The printhead 44 operates under the control of power, 
ground and signal lines 810 which provides power and control 
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for the printhead 44 and are bonded by means of Tape 
Automated Bonding (TAB) to the surface of the print 
printhead 44 . . 

Importantly, the printhead 4 4 which can be constructed 
from a silicon wafer device suitably separated, relies upon 
a series of anisotropic etches 812 through the wafer having 
near vertical side walls. The through wafer etches- 812 
allow for the .direct supply of ink to the printhead surface 
from the back of the wafer for subsequent ejection. 

The ink is supplied to the back of the Inkjet printhead 
44 by means of inkhead supply unit 814,. The Inkjet 
printhead 44 has three separate rows along its surface for 
the supply of separate colours of ink. The inkhead supply 
unit 814 also includes a lid 815 for the sealing of ink 
channels. 

In Figs. 132 - 135, there is illustrated various 
perspective views of the inkhead supply unit 814. Each of 
Figs. 132 - 135 illustrate only a portion of the ink head 
supply unit which can be constructed of indefinite length, 
the portions shown so as to provide exemplary details. - In 
Fig. 132, there is illustrated a bottom perspective view. 
Fig. 133 illustrates a top perspective view. Fig. 134 
illustrates a close up bottom perspective view, partly in 
section. Fig. 135 illustrates a .top. . .side ..perspective view 
showing details of the ink channels, and Fig, 136 
illustrates a top side .^perspective view as does Fig. 137. 

There is considerable cost advantage in forming inkhead 
supply unit 814 from injection moulded plastic instead of, 
say, micromachined silicon. The manufacturing cost of a 
plastic ink channel will be considerably less in volume and 
manufacturing is substantially easier. The design 

illustrated in the accompanying drawings assumes a 1600 dpi 
three color monolithic print head, of a predetermined 
length. The provided flow rate calculations are for a 100mm 
photo printer. 

The inkhead supply unit 814 contains all of the 
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required fine details. The lid 815 (Fig. 131) is 

permanently glued or ultrasonically welded to the inkhead 
supply unit 814 and provides a seal for the ink channels. 

Turning to Fig 132, the cyan, magenta and yellow ink 
flows in through ink inlets 820-822, the magenta ink flows 
through the throughholes 824,825 and along the magenta main 
channels 826, 827 . (Fig , 133). The cyan ink flows along cyan 
main channel 830 and the yellow ink flows along the yellow 
main channel 831. As best seen from Fig. 134, the cyan ink 
in the cyan main channels then flows into a cyan subchannel 
833. The yellow subchannel :8 34 siTailarly-: receiving yellow 
ink from the yellow main channel 831. 

As best seen in Fig. 135, the magenta ink also flows 
from magenta main channels 826,827 through magenta 
throughholes 836, 837. Returning again to Fig. 134, the 
magenta ink flows out of the throughholes 836, 837. The 
magenta ink flows along first magenta subchannel e.g. 838 
and then along second magenta subchannel e.g. 839 before 
flowing into a magenta trough 84 0. The magenta ink then 
flows through magenta vias e.g. 842 which are aligned with 
corresponding Inkjet head throughholes (e.g. 812 of Fig. 
131) wherein they subsequently supply ink to Inkjet nozzles 
for printing out. 

Similarly, the cyan ink ;withinj±he...cyan..::siabchanriel 833 
flows into a cyan pi t-::;area 849 .:which _ supplies- ink two cyan 
vias 843, 844. Similarly, the yellow subchannel 834 

supplies yellow pit area 4 6 which in turn supplies yellow 
vias 847, 848. 

As seen in Fig. 135, the printhead is designed to be 
received within printhead slot 850 with the various vias 
e.g. 851 aligned with corresponding through holes eg. 851 in 
the printhead wafer. 

Returning to Fig, 131, care must be taken to provide 
adequate ink flow to the entire printhead chip 44, while 
satisfying the constraints of an injection molding process. 
The size of the ink through wafer holes 812 at the back of 
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the print head chip is approximately lOOpin x 50|im, and the 
spacing between through holes carrying different colors of 
ink is approximately 170|jm. While features of this size can 
readily be moulded in plastic (compact discs have micron 
sized features), ideally the wall height must not exceed a 
few times the wall thickness so as to maintain adequate 
stiffness. The preferred embodiment overcomes these 

problems by using hierarchy of progressively smaller ink 
channels. 

In Fig. 136, there is illustrated a .wire frame view of 
a small portion 870 of ; the surf ace of the printhead 44. The 
surface is divided into 3 ;^ series : of ..nozzles .^comprising the 
cyan series 871, the magenta series 872 and the yellow 
series 873. Each series of nozzles is further divided into 
two rows eg. 875, 876 with the printhead 44 having a series 
of bond pads 878 for bonding of power and control . signals . 

The print head is preferably constructed in accordance 
with a large number of different forms of ink jet invented 
for uses including Artcam devices. A full list of the 
different invented ink jet types is as set out in the 
associated . Australian Provisional Patent Applications as 
set out appendix A attached hereto, the applications being 
filed concurrently herewith. In particular, the present 
embodiment assumes the. ink jet as set ..out in associated 
Australian Provisional.- vPatent .Application. ;:entltled "'Image 
Creation Method and Apparatus (IJ30)" has been utilised. 

The printhead nozzles include the ink supply channels 
880, equivalent to anisotropic etch hole 812 of Fig. 131. 
The ink flows from the back of the wafer through supply 
channel 881 and in turn through the filter grill 882 to ink 
nozzle chambers eg. 883. The operation of the nozzle 
chamber 883 and printhead 44 (Fig. 1) is, as mentioned 
previously, described in the abovementioned patent 
specification. 

Ink Channel Fluid Flow Analysis 

Turning now to an analysis of the ink flow, the main 
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ink channels 826, 827, 830, 831 {Fig, 132, Fig. 133) are 
around Imm x Imm, and supply all of the nozzles of one 
color. The subchannels 833, 834, 838, 839 (Fig. 134) are 
around 200|iin x 100|Liin and supply about 25 Inkjet nozzles 
5 each. The print head through holes 843, 844, 847, 848 and 
wafer through holes eg. 881 (Fig. 136) are 100)iin x 50\xm and, 
supply 3 nozzles at each side of the print head through 
holes. Each nozzle filter 882 has 8 slits, each with an 
area of 20|aiu x 2j4in and supplies a single nozzle, 

10 An analysis has been conducted of the pressure 

requirements of an ink jet printer constructed as described. 
The analysis is for a 1,600 dpi three color process print 
head for photograph printing. The print width was 100 mm 
which gives 6,250 nozzles for each color, giving a total of 

15 18,750 nozzles. 

The maximum ink flow rate required in various channels 
for full black printing is important. It determines the 
pressure drop along the ink channels, and therefore whether 
the print head will stay filled by the surface tension 

20 forces alone, or, if not, the ink pressure that is required 
to keep the print head full. 

To calculate the pressure drop, a drop volume of 2.5 pi 
for 1,600 dpi operation was utilized . While the nozzles 
may be capable of operating at a higher rate, the chosen 

25 drop repetition rate is .5 KHz which . is .suitable to . print a 
150 mm long photograph in an little under 2 seconds. Thus, 
the print head, in the extreme case, has a 18,750 nozzles, 
all printing a maximum of 5,000 drops per second. This ink 
flow is distributed over the hierarchy of ink channels. 

30 Each ink channel effectively supplies a fixed number of 
nozzles when all nozzles are printing. 

The pressure drop Ap was calculated according to the 
Darcy-Weisbach formula: 

Ap = pU^ fL 
35 2D 
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Where p is the density of the ink, U is the average 
flow velocity, L is the length, D is the hydraulic diameter, 
and r is a dimensionless friction factor calculated as 
follows : . 

Re 

Where Re is the Reynolds number and A: is a 
dimensionless friction coefficient dependant upon the cross 
section of the channel calculated as follows: 
Re = UP 

V 

Where v is the kinematic viscosity of the ink. 

For a rectangular cross section, k can be approximated 



by: 
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"lljb (2 - b/a) 
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Where a is the longest side of the rectangular cross 

section, and jb is the shortest side. The hydraulic diameter 

D for a rectangular cross section is given by: 

D = 2ab 
a -h b 

Ink is drawn off the main ink channels at 250 points 
along the length of the channels. The ink velocity falls 
linearly from the start of the channel- to- --zero at the end of 
the channel, so the average flow velocity (J is half of the 
maximum flow velocity. Therefore, ^ the pressure drop along 
the main ink channels is half of that calculated using the 
maximum flow velocity 

Utilizing these formulas, the pressure drops can be 
calculated in accordance with the following tables: 
Table of Ink Channel Dimensions and Pressure Drops 
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The total pressure drop from the ink inlet to the 
nozzle is therefore approximately VOlPa for cyan and yellow, 
and 84 5 Pa for magenta. This is less than 1% of atmospheric 
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pressure- Of course, when the image printed is less than 
full black, the ink flow (and therefore the pressure drop) 
is reduced from these values. 
Making the Mold for the Inkhead Supply Unit 
5 The ink head supply unit 14 (Fig. 1) has features as 

small as 50|a, and a length of 106mm. It is impractical to 
machine the injection molding tools in the conventional 
manner. However, even though the overall shape may be 
complex, there are no complex curves required. The 

10 injection molding tools can be made using conventional 
milling for the main ink channels and other ..mrllimetre scale 
features, with a lithographically fabricated ^ inset for the 
fine features. A LIGA process can be used for the inset. 

A single injection molding tool could readily have 50 

15 or more - cavities. Most of the tool complexity is in the 
inset. 

Turning to Fig. 131, the printing system ' is 
constructed via molding ink supply unit 814 and lid 815 
together and sealing them together as previously described. 

20 Subsequently printhead 44 is placed in its corresponding 
slot 850- Adhesive sealing strips 852, 853 are placed over 
the magenta main channels so to ensure they are properly 
sealed. The Tape Automated Bonding (TAB) strip 810 is then 
connected to the Inkjet printhead 44 with the tab bonding 

25 wires running in the. cavity 855. As . can best - be seen from 
Fig 136 and 1377, aperture slots are 855 - 862 are provided 
for the snap in insertion of rollers. The slots provided 
for the ^"clipping in" of the rollers with a small degree of 
play subsequently being provided for simple rotation of the 

30 rollers. 

In Figs. 138 - 142, there are illustrated various 
perspective views of the internal portions of a finally 
assembled Artcam device with devices appropriately numbered. 
• Fig. 138 illustrates a top side perspective view of the 
35 internal portions of an TVrtcam camera, showing the parts 

flattened out; 
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• Fig. 139 illustrates a bottom side perspective view of 
the internal portions of an Artcam camera, showing the 
parts flattened out; 

• Fig. 140 illustrates a first top side perspective view 
of the internal portions of an Artcam camera, showing the 
parts as encased in an Artcam; 

• Fig. 141 illustrates a second top side perspective view 
of the internal portions of • ah Artcam camera, showing the 
parts as encased in an Artcam; 

• Fig.. 142 illustrates a second top side perspective view 
of the internal portions of an Artcam camera, showing the 
parts as encased in an Artcam; 

Postcard Print Rolls 

Turning now to Fig. 151, in the preferred embodiment, 
the output printer paper 11 can, on the side that is not to' 
receive the printed image, contain a number of . pre-printed 
"postcard" formatted backing portions 885. The postcard 
formatted sections 885 can include prepaid postage "stamps" 
8 86 which can comprise a printed authorisation from the 
relevant postage authority within whose jurisdiction the 
print roll is to be sold or utilised. By agreement with the 
relevant jurisdictional postal authority, the print rolls 
can be made available having different postages . This is 
especially convenient where overseas travellers are in a 
local jurisdiction and : wishing .to send a .number of postcards 
to their home country. Further, an address format portion 
87 is provided for the writing of^ address dispatch details 
in the usual form of a postcard. Finally, a message area 
887 is provided for the writing of a personalised 
information. 

Turning now to Fig. 151 and Fig. 151, the operation of 
the camera device is such that when a series of images 890- 
892 is printed on a first surface of the print roll, the 
corresponding backing surface is that illustrated in Fig, 
153. Hence, as each image eg. 890 is printed by the camera. 
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the back of the image has a ready made postcard 885 which 
can be immediately despatched at the nearest post office box 
within the jurisdiction. In this way, personalised 

postcards can be created. 

It would be evident that when utilising the postcard 
system as illustrated in Fig. 151 and Fig. 152 only 
predetermined image sizes are possible as the 
synchronisation between the backing postcard portion 885 and 
the front image 891 must be maintained. This can be 
achieved by utilising the memory portions of the 
authentication chip stored within the print xoll to store 
details of the length of each postcard backing format sheet 
885. This can be achieved by either having each postcard 
the same size or by storing each size within the print rolls 
on-board print chip memory. 

The Artcam camera control system can ensure that, when 
utilising a print roll having pre-formatted postcards, that 
the printer roll is utilised only to print images such that 
each image will be on a postcard boundary. Of course, a 
degree of "play" can be provided by providing boarder 
regions at the edges of each photograph which can account 
for slight misalignment. 

Turning now to Fig. 153, it will be evident that 
postcard rolls can be pre-purchased by a camera user when 
travelling within a particular jurisdiction where they are 
available. The postcard roil . can, on its ■external surface, 
have printed information including country of purchase, the 
amount of postage on each postcard, the format of each 
postcard (for example being C,H or P or a combination of 
these image modes), the countries that it is suitable for 
use with and the postage expiry date after which the postage 
is no longer guaranteed to be sufficient can also be 
provided. 

Hence, a user of the camera device can produce a 
postcard for dispatch in the mail by utilising their hand 
held camera to point at a relevant scene and taking a 
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picture having the image on one surface and the pre-paid 
postcard details on the other. Subsequently, the postcard 
can be addressed and a short message written on the postcard 
before its immediate dispatch in the mail. 
5 It would be appreciated by a person skilled in the art 

that numerous variations and/or modifications may be made to 
the present invention as shown in the specific embodiment 
without departing from the spirit or scope of the invention 
as broadly described. The present embodiment" is, therefore, 

10 to be considered in all respects to be illustrative and not 
restrictive . 

The present provisional is one of a series of 
Australian Provisional Patent Applications which relate to a 
new form of technology for the production of images. These 

15 Australian Provisional Patent Applications encompass a broad 
range of fields and as such, the present provisional is best 
viewed in the overall context of the development of this new 
form of technology. Appendix A attached hereto sets out the 
details of each of the series of Australian Provisional 

20 Patent Applications and, to the extent necessary, the 

associated Australian Provisional Patent Applications are 
hereby incorporated by cross-reference. 
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We claim: 

1. A camera system comprising: 

at least one area image sensor for imaging a 

scene; 

5 a camera processor means for processing said image 

scene in accordance with a predetermined scene 
transformation requirement; .and 

a printer for printing out said processed image scene 
on print media said printer, print media and printing ink 
10 stored in a single detachable module inside said camera 
system; 

said camera system comprising a portable hand held unit 
for the imaging of scenes by said area image sensor and 
printing said scenes directly out of said camera system via 
15 said printer. 

2. A camera processor as claimed in claim 1 further 
comprising a print roll for the storage of print media and 
printing ink for utilisation by said printer, said print 
roll being detachable from said camera system. 

20 3. A camera system as claimed in claim 2 wherein said 

print roll includes an authentication chip containing 
authentication information and said camera processing means 
is adapted to interrogate said authentication chip so as to 
determine the authenticity of said print .roll when .inserted 

25 within said camera system. 

4. A camera system, as claiined -in. claim .1' wherein said 
printer comprises a drop on demand ink printer. 

5. A camera system as in claim 1 further comprising a 
guillotine means for the separation of printed photographs. 

30 6. A camera system as claimed in claim 1 wherein the 

number of area image sensors is at least 2 and said camera 
processor means includes means for deriving a stereoscopic 
image from said area image sensors and said print media 
includes means for stereoscopic imaging of said stereo 

35 images so as to produce a three dimensional affect. 
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Dated this 15th day of July 1997 



Silverbrook Research Pty Ltd 
By their Patent Attorneys 
GRIFFITH HACK 



Abstract 

A camera system comprising: 

at least one area image sensor for imaging a 

scene; 

a camera processor means for processing said image 
scene in accordance with a predetermined scene 
transformation requirement; and 

a printer for printing out said processed image scene 
on print media said printer, print media and printing ink 
stored in a single detachable module inside said camera 
system; 

said camera system comprising a portable hand held unit 
for the imaging of scenes by said area image sensor and 
printing. said scenes directly out of said camera system via 
said printer. 
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